Techie Details

Here are some FAQ's I get about how my pricelist software works:

Where do you get the data from?

The data comes from newsgroups in the rec.games.trading-cards hierarchy on Usenet. Data is collected every Wednsday morning between around 6:00 AM (when my cron job to do this runs) and 9:00 or so when the data is done downloading (check out our network situation to see why this takes so long). There is usually between 28 and 36 MB of raw data per week that has to be transferred.

How do you get the data off the news server?

I used to use shell scripts to automate the process described in About Cloister's Trading Card Price Lists. However, those were originally written on a DECStation 5000 running DEC Ultrix. As it turns out, many vendor supplied versions of telnet don't like doing file-redirected I/O. When I graduated and had to move off of the University of Washington's computers and onto my Linux machine at home, my scripts didn't work properly anymore. I'd been thinking anyway about going one better than scripts and writing a C utility to interface to the news server for me, and it looked like then was the time I was going to have to do that. But as luck would have it, some clever guy out there produced a set of utilities called "suck" (because they suck news off of a server, not because they are bad utilities) which happened to do just what I needed. So what I'm using now are some shell scripts built around suck to get the raw data and do some sorting and pre-processing on it before the actual lists get made.

How do you process the articles into usable data?

The raw article stream contains, as you would guess, a whole lot that isn't price data. The first thing I do (well, the abovementioned shell script does it) is call a utility I wrote that sorts the raw articles into groups based on what game they're from. This is necessary because there are some cards from different games with identical names. If I processed it all at once to make the different lists simultaneously, data for those particular cards would be wrong. After that, I process each group of files through grep to get all the lines that look like they have prices on them (defined as numbers with zero or more digits, a decimal point, and then two digits after the decimal). This reduces the junk to tolerable levels, such that most of what's left is actually useful data.

How do you coallate all that sorted data into actual pricelists?

I use custom utilities I wrote. If you as a human read a line of data, you can almost instantly find tha part of the line that is the card name, and know which card that is. That's pretty amazing. In a split second, your brain is capable of comparing all the text on the line (which probably includes stuff besides the card name) against a very long list of strings (e.g. M:tG has over 1000 different cards in it now) and figuring out which one the text is talking about. That's a pretty horrific string matching problem for a computer, but your brain does it in a flash. Don't let that fool you into thinking it's easy. The simple solution for a programmer would be to iterate through all the card names for some game, and see if the name is present in the line of text. Those readers who have done any C programming will instantly realize what a slow process that would be.

Whatever solution I came up with, it had to satisfy these properties: a) be fast, b) not be dependant on the number of card names, since I knew that M:tG (the only trading card game around at the time) was going to get a lot bigger as time went on, and since I knew that the amount of data was going to increase as time went on too, c) be tolerant of errors in the data, because some people are terrible spellers, some people use horrible abbreviations of card names, and some are just too lazy to look up the real name of a card when they can't think of it so they guess.

What I did was to build a big tree structure containing the card names in such a way that input data could be compared against all the strings in the tree more or less simultaneously. So that reduced the problem from an O((number of card names)*(amount of data)) problem into an O(amount of data) problem. This was a big win. Once I had that working, I added some error-correcting heuristics into the algorithm so that it could catch most common errors (things like missng letters, trnasposed letters, incorrwct letters, and so on). That worked very well. Really amazingly remarkably well, in fact. That worked well enough that I released the first version of the M:tG pricelist on May Eighteenth, 1994.

After solving the "what card is this line of data talking about" problem, it had to write a routine to find the price information on the line of data and tally it up. That part was trivial in comparison, so I won't go into detail on it.

How long does it take to make the pricelists each week?

It used to take a long time (three hours or so of my time), but I and my software have gotten a lot better at it since then. If you count total process time, about four and a half hours. About 3 of those hours are downloading the data over our slow PPP link. About a half hour of that is sorting the data into per-game groups and grepping it for lines of data (this is on a 100MHz '486 with 48Mb of ram, running Linux 1.2.13. As a point of interest, having a machine like this all to myself is a lot faster than using a shared DEC 5000 at school). The remaining hour is me running the data through the actual pricelist generator, doing housekeeping chores, and answering price list related e-mail. That the last part only takes about an hour is a credit to my string matching routine. At full tilt it can process around 560 lines of data per second. I'm pretty darned proud of that, considering how much processing has to happen for each line of data.

What are some of the major changes from your original pricelists to the current ones?

There have been four or five different pricelist formats since the first version. The major format changes include a) switching from simple averaging to more complex statistical analysis of the data, b) going from an all lowercase list to a list with properly capitalized and punctuated card names (this required some tweaking of my string matching software), c) introducing the "change" column to reflect the difference in prices since the previous list, and d) adding totals at the bottom of the list.

On the side of the algorithms behind the scenes, the first major change was to modify the format of the card name database I use so that it could store typical abbreviations and misspellings of cards that the error correcting heuristics couldn't catch. This is effectively like saying "this particular error is really not an error". Call it a hack if you will, but it works, and since the algorithm isn't sensitive to the number of names in the database, there's no speed penalty for adding "virtual" card names.

The second major change came with the introduction of the statistical analysis of the data. Obviously, that changed the nature of the prices in the list, but for the better I think.

There was also a minor change along the way in how the program interprets lines with multiple price-like numbers on them. I used to just take the highest one, but now the algorithm for picking the price is tuned to the sorts of sale and auction formats that people really use in an attempt to find more actual prices and fewer buyout bid numbers.

Why won't you be more specific about your string matching algorithm or release your software?

The short answer is "because I don't want to." The long answer used to be the following, until Shadis started totally ignoring any attempts I made to communicate with them: Because I now have a financial interest in my pricelists. Shadis magazine publishes my lists monthly now, and in doing so they pay for my network connection. I don't think I'd be doing myself, Shadis, or the card trading/selling community any favors by releasing my software or my algorithms. I wouldn't like to see a situation where a) I screwed the nice folks at Shadis magazine over, b) I had to pay for my own net connection (horrors!), or c) where people with less of a committment to providing the lists for free on the net were able to make them and charge for them. It is important to me that these lists be freely available to anyone with usenet, www, or ftp access.

Am I selling out? Maybe. But I've worked hard to make a quality, free product that people find useful and I think I deserve to get a little something back for it. I think my lists are more realistic than those in Scrye magazine; as I understand it Scrye takes surveys of card shop owners to see what they're buying/selling cards for, while I directly sample what normal people all over the world are buying/selling them for. My lists are certainly more current than Scrye's lists. They're definitely cheaper than buying a copy of Scrye every month, and hopefully they're more accessible to people both inside and outside the U.S.

you can reach me at:: cloister(at)hhhh(dot)org