Combined USCF and FIDE rating databases

Owing to a production error, I had to recreate the March 2021 combined database. The bad version has been replaced. Please download the new version (same link as always). I apologize for any inconvenience.

The April 2021 combined database is now available. The link to the latest version of the database will now return the April 2021 database. (There is also a direct link to this database: https://www.sudburyriverchess.com/ratings/uscf-fide_2021-04.zip.)

There is a SwissSys database configuration file for use with the combined database at https://www.sudburyriverchess.com/ratings/uscf-fide.dbp. If you haven’t already configured SwissSys to use the combined database, you can download this file, launch SwissSys, choose “Database Setup” under the “Database” menu, and click the “Load Settings…” button. Just change the “Database file” field in the “Database Setup” dialog to point to the directory where you have placed the uscf-fide.dbf file and click the “OK - Apply” button. If you’ve already configured SwissSys to work with the database, you don’t need the uscf-fide.dbp file.

With the death of Ken Ballou, as reported on the US Chess website earlier today, the status of this monthly posting is in doubt.

Is this something that is widely enough used that US Chess should consider some kind of support for it? I know the idea of a new supplement format has been kicked around for years, but a new format would probably require updates to WinTD and SwisSys, and there’s still the timing issue, because US Chess generates its ratings for the next month between the 15th and 21st of the previous month while FIDE doesn’t post its ratings for the next month until around the last day of the previous month.

I can say I would appreciate this with a resounding YES. At the very least if US Chess produced a file with the full player’s name, other normal fields (ID, state, etc.), FIDE ID, and all 6 ratings that would be enough. Then this can be combined with the main FIDE list at the end of the month. Also in the US Chess file it should either have the player’s status (or code) or filtered to exclude IDs that are no longer needed (duplicate, invalid, etc.)

Many people are perhaps still unaware that since last October we have been posting an ‘allratings’ file, which has the US Chess ID, player’s name, state, country, expiration date, status (deceased, duplicate, etc), FIDE ID, FIDE country, latest rating list date and all 6 OTB and online ratings.

This was done in large part to facilitate Ken’s combined USCF/FIDE databases.

That is awesome! I was unaware of this file. If desired, I can build the monthly database from this.

What do the different status codes mean? I’m presuming “A” is the normal active one, “D” for duplicate?, “I” for inactive? Is “E” deceased? What is “U”?

You got the others right, U is ‘undefined’, meaning that no membership has been processed, which could mean a membership was process but subsequently reversed or there was some kind of processing problem with a membership, but may also show up for some non-members who were exempted from the membership requirement. (We used to require non-members being coded as a Q or an X before those IDs could be used in a tournament, but I’m not sure if the new membership system does that.)

It sounds as though this file should serve most of the same purposes as Ken’s file. And if that file also contains a link to FIDE’s ratings file, that should serve most of the rest of that purpose.

I’ve spent the past 36 hours being bummed out about Ken’s untimely departure from our world. It seems he was an Atlas of U.S. Chess, supporting major portions of its internal structure – not only ratings, but also rules, ethics, elections, and more. It will now be difficult for the rest of us to try to rush in where both angels and fools have attempted to tread, some with fear and some without.

Can WinTD and SwissSys be upgraded to read this “allratings” file directly, instead of the current .DBF files? Reading between the lines of some previous posts, I get the feeling that there are strong disagreements between at least one person at U.S. Chess and the author of at least one of the pairing program authors, regarding the format of such files – JSON versus comma-delimited, or something like that – and that this disagreement has impeded progress. Perhaps now is the time to bite the bullet and get something worked out.

Bill Smythe

The allratings file is, as I recall, tab delimited, so it is neither JSON nor CSV. It would be up to the pairing program authors to decide whether and how to support it. I don’t know if they have a place to put the status information, and I think WinTD and SwisSys currently only support having information on two ratings systems.

That’s good to hear. It should be easy enough for the pairing programs to accommodate tab-delimited, and to allow the user to choose (if necessary) which two rating systems to import.

It sounds like the all-ratings file is sort of like the old Golden Database, which unfortunately had a slightly different format from the .DBF rating files used by the programs. Is it (or will it be) possible to issue the monthly supplements in the same format (i.e. with the same fields) as the all-ratings file? That way the pairing programs would be able to use a much smaller version, but still in the newer format.

Bill Smythe

Each month we currently post well over a dozen different ratings files. We need to be thinking about lowering that number, not increasing it.

There are no plans to add additional formats, and still no consensus on what fields to include or what format(s) to support, either.

DBF formats, in addition to being archaic and not well supported these days, are fixed field, so for every character you add to the length of a field, for example, the name field, you add over a million characters to the total file length. Variable-length formats don’t waste space like that, though JSON and XML use header tags in each record that can take up even more space.

I would think that using the format of the all-ratings file could be the first step in reducing the number of files each month to three:

  • One could be the whole monster (a la the current Golden Data file), including everything going back to when Noah was a boy.
  • Another could be an annual file, consisting of everything that has changed since the previous annual file.
  • The third could be a monthly file, just those records that have changed since the previous month.

All of these could be in the same format, i.e. that of the all-ratings file.

The pairing programs could take a looksee at the file type (monster, annual, or monthly) and file date (year and month) of the most recent file(s) loaded, and make a decision as to which file type(s) it needs to request.

Support of the .DBF files could end about 1 or 2 years after the pairing programs have implemented the above. That would give users of pairing programs plenty of time to update their versions.

Bill Smythe

XML is pointless as well as being —Microsoft Office uses it because they put much of the formatting information into the attributes. The US chess data is just straight data so JSON is fine, and is better supported since it’s simpler than XML. (I experimented with SQL, XML and JSON for the future format for WinTD files, and JSON was clearly superior—there just wasn’t enough value to the XML attributes, so XML ended up looking like JSON with closing tags).

I don’t know if you did it, or I did it, but a JSON version of the current GDB expands to about 50% larger than the DBF format and zips to about the same 50% larger. It could be a bit smaller in practice, since this had the numeric fields put in as strings (which is how they are represented in the DBF spec, though that doesn’t matter for DBF since all character fields are undelimited) while strings are “…” in JSON. Obviously if you add the extra ratings and lengthen the name field and add FIDE information… it will get longer, but it will get longer no matter what format you choose when you add fields. Probably 500Mb zipping to 50Mb as opposed to the current (with DBF) 170Mb zipping to 20Mb.

I wrote a program that built an XML version of the Golden Master database around 10 years ago, and then modified it to do JSON a couple years later. (The differences are minor on the surface level, though as you dig into the full capabilities of either format, they get larger.)

Depending on how verbose the XML/JSON token tags are and how many additional fields are added (for things like status information, duplicate ID pointers, FIDE information, other ratings systems, national master, OLM and norms-based title information), it ran anywhere from 100 to over 250 MB uncompressed. (By comparison, the Golden Master DBF format was about 30 MB uncompressed at the time.) It compressed a lot, of course.

Numeric tokens would save a lot of space, but then you lose the self-documenting aspects of either XML or JSON. A token that says is easy to understand, one that says <42> is not.

An idea I’ve seen in some of the JSON discussion forums is that the first use of any token has both a numeric and a text component but later ones have only the numeric token, so the first time it appears it would say <42 online regular rating> but after that it would only say <42>. I don’t know if any of the JSON validity checking programs support that idea.

XML has a ‘data dictionary’ format which could be used to document coded tokens, but even the XML advocates seemed less than thrilled about it.

I heard from a number of people who were concerned that US Chess might drop support for the existing formats, which would probably have forced TDs to buy an upgraded version of their pairing program(s). At the time we still had at least two TDs using a pre-2000 non-Windows version of SwisSys. (The programs treated dates in a way that made it easy to recognize that they were not the Windows versions.)

From this point forward I will use at most 3 rating files from the current ones produced - All ratings, Golden DB Blitz, and Golden DB Quick. Once WinTD recognizes a more inclusive file, I’ll stop using two of those three files.

With the FIDE ratings, the pairing programs should support 9 ratings fields. I have a few suggestions of additional ways to use those ratings, but that’s a different discussion.

The master FIDE ratings file is easily retrievable from the FIDE site. It’s easy to combine the All ratings file and this file at the end of the month. If people are interested, I’ll do this and make it publicly available. I’ll be doing it anyway for use on my own site.

As far as a good ratings file going forward, I completely agree. Forget XML. Either a tab-delimited or JSON file is much superior.

For the pairing programs it would be nice to support either the current All Ratings file or whatever new file we agree on plus an expanded version that includes the FIDE ratings that can be built at the end of the month. The expanded one can either be hosted by US Chess or externally.

FIDE is still the best place to get accurate FIDE ratings information. In case people aren’t aware of it, FIDE issues minor corrections to their monthly FIDE ratings file throughout the month. IMHO it would be better if the pairing programs had the ability to process FIDE ratings files as well as US Chess ones.

We know we have dozens of FIDE IDs that appear in more than one US Chess member record. Some of them are just duplicate records, but others are obviously not.

As I recall, there are also a few players in the FIDE system coded as ‘USA’ who don’t appear to be anyone active in US Chess.

Thank you for joining this conversation. You, as the WinTD programmer, are probably better equipped than anybody else to answer a few questions and propose some new ideas.

Besides SQL, XML, and JSON, what are the advantages and disadvantages of each of the following?

  • comma-delimited (CSV), with a header line
  • tab-delimited, with a header line
  • fixed-length, with a header record

I would think the first two would be efficient, and would allow for future changes (such as adding new fields) that would not require changes in the pairing software.

There’s something to be said for a fixed-length format, too. For one, it’s probably the easiest format to write a program around, and it would not need to be “translated” into fixed-length after loading, since it would already be that way.

If, in addition to fixed-length, all fields are stored as characters (e.g. a rating could be stored as “1234” rather than as two-byte binary), AND if all records ended with a CR/LF, then the file could be read (and understood) by anybody, just by viewing it with NotePad. (This might fit well with Mike Nolan’s “self-documenting” idea too.)

Bill Smythe

Tab-delimited and CSV are essentially the same, so there is no need to have both. Tab-delimited is better as you don’t need to escape fields as much. With CSV you’ll need extra quote characters surrounding the field if there are any commas in the field (such as the name field). This adds extra space.

The important thing is that any reader of the file must be flexible and not assume that the fields are in any particular order. The header line determines what is present and in what order. That way additional fields can be added over time without breaking older programs.

Fixed-length is useless. It not only adds a lot unnecessary space, but it is also inflexible. It is harder to add or change the order of fields as you need an outside spec to indicate how wide each field is. You could add an additional header line that indicates this, but just going with a delimited file is easier to begin with.

Yes, having the pairing programs also read the FIDE file is a better option. Of course, there are a lot more records in that file then we necessarily need if we’re just limiting to those matching in the US Chess file. In any case a cross reference between the two files is necessary. FIDE IDs present in more than one US Chess record isn’t that much of a problem if you go from the direction of the US Chess file. Start with the US Chess ID, then retrieve the ratings from the file with the listed FIDE ID. More than one US Chess record will end up using the same FIDE ratings, but that’s the way it goes when you have inaccurate information.

PHP has tools that can be for splitting out CSV, tab-delimited, JSON and XML files, though I found the JSON and XML tools a bit challenging to use because you still had to figure out which fields were absent. Tab-delimited is usually easier for me to use than CSV, though I have seen data sets with embedded tabs in fields.

When building a tab-delimited data file, I always make sure there aren’t any embedded tabs in fields, I’ve actually found a few in some name fields in our database, though I have no idea how someone would enter one. But one of the challenges with user-entered data is you have a lot less control over what you get. Recently I’ve been looking into an occasional webstore problem that appears to be related to non-English characters in data fields, like ñ, that cause problems with the response from our credit card processing vendor.

PHP still has some support for DBF files, though I think it is no longer part of the base package and could be dropped completely at some point. I think the main reason a DBF file was used for those files was that at the time US Chess was using a dBase/Clipper system, so it was the ‘native’ format.