It’s easy enough within WinTD (and I assume SwissSys) to verify USCF membership id’s, expiration dates and ratings for those who provide a number. Is there an easy way to check en masse a list of players who claim to NOT have a USCF id?
I’m involved with several scholastic tournaments each year which offer non-rated sections for players who have not previously played rated games, and it’s not unusual to have close to a hundred or more we need to verify aren’t in the database. It’s way too tedious and time consuming to look them up one at a time!
If all you have is the name, matching it against our 800,000+ player database using MSA could get messy, both due to the potential number of matches and spelling issues. (Searching for ‘Michael Johnson’ won’t find ‘Mike Johnson’, or ‘Michael Jonson’ for example.)
One of my favorite examples is ‘Kevin Wang’. That matches 47 records, 15 of them in California.
The membership batch tool on TD/A can be used to check for multiple players and matches against address and birthdates, but it will take some time to enter multiple records into the form. (And it may not catch similar but not identical names.)
If you think they’re all in-state players, you could try generating a custom rating list for all the players in your state. (But we have over 24,000 current and former members in Ohio.)
We probably need better tools in this area, but creating them will not be a simple task.
The WinTD “Scan Supplement for IDs” (which applies to Master Files) will search the downloaded USCF database looking for “unambiguous” matches. It uses a fuzzy string comparison so that Alekhine, Alexander and Alekhine, Alex A. would probably be considered to be a match—it would be unambiguous if there weren’t another record which was also a relatively close match.
This basically punts on the Wang, Kevin situation where there are a large number of exact or near exact matches.
Assume for the moment that it a tool could be created which would take a large number of names, possibly with some global parameters (like preference for those in one or more states, recent activity, age range, rating, etc.) and report back potential matching IDs for further research. (The report might have to be sent by email because a large number of names could take several minutes to research, depending upon how we dealt with the fuzzy name match issue, and a browser-based form might time out before getting a response, which might need to be done in batch mode when the USCF server was relatively idle.)
I’m not saying such a tool is possible or likely to get written, though I did some work on a ‘nickname’ database a while back. ‘Robert Smith’ would match up with ‘Bob Smith’, ‘Bud Smith’, etc. (This is an interesting non-trivial data matching question in general. Should ‘Robert Smith’ match with ‘R Jonathan Smith’?)
What format(s) would people be able to provide data in?
For submission, a combination of Excel or ODS (or I suppose CSV) would be most cool… If it could be passed via thin client (I’m assuming that Thin 3 is ID/Date only - never tried any other way,) with wildcards, I’d do it myself in VBA as I’ve done for ID searches when we host ICA events. (Though parsing the return would be a challenge when we don’t know how many entries would be returned.)
What I would like most to see is a checkbox option in Player lookup that would automatically include a reverse name search. (i.e. searching for Eri*,D* would give me Eric Darr in addition to Darren Erickson.) That way I could know that if I were searching for Brady Cooper that I’d also get to see Cooper Brady as well, in addition to names that I have no idea which is truly the surname. (Or the person who registered it didn’t know…) As it is I usually run two searches per name unless I think it would be insane - Trevor Wierzbowski, for instance.
Some of the ‘cooler’ potential features, like address, ZIP code radius or birthdate/age matching, raise privacy issues. Whether putting that behind a password/login (eg, on TD/A) meet various federal privacy guidelines may be a question for our legal counsel.
I’m not sure what ‘fuzzy matching’ algorithm Tom is using in WinTD, there is a levenshtein function in our database program. Looking for a first name of ‘Bill’ and allowing a levenshtein of 2 or less would match any of the following (and this is just a partial list):
Bill
Billy
Phil
Phill
Will
Aila
Jill
Lili
Bilo
Keil
Nile
Bien
Yili
Bella
Bimal
Lila
Notice that ‘William’ does not match, it has a levenshtein value of 4 when compared to ‘Bill’.
The basic algorithm is described in:
“An O(ND) Difference Algorithm and its Variations”, Eugene Myers,
Algorithmica Vol. 1 No. 2, 1986, pp. 251-266;
see especially section 4.2, which describes the variation used below.
The basic algorithm was independently discovered as described in:
“Algorithms for Approximate String Matching”, E. Ukkonen,
Information and Control Vol. 64, 1985, pp. 100-118.
I put a heavy weight on a near-perfect match for the last name. I actually tried Nolan, Mike and got Nolan, Ricky before Nolan, Michael. If nicknames tended to have the first letter or two matching the full name it would be easier to use fuzzy compares, but then there’s Richard/Dick or Robert/Bob.
With over 835,000 records in the USCF database (and growing every day), I’m not sure matching on name alone is going to be much help. I do tend to agree that last-name matches should be tighter than first-name matches, though we now have nearly 6000 hyphenated last names in the USCF database.
I think we also have over 3000 cases where interchanging the first name and last name would produce identical matches. Some of those are probably duplicates that haven’t been flagged, but a lot of them are either cultural issues or just names that are common as surnames and given names.
I actually allow for the first and last being flipped. fstrcmp returns in the range [0,1] (1=exact match, 0=nothing in common).
[code] //
// Do fuzzy compares of lasts with each other and with the first name
//
if (last1 && last2)
lastDiff = fstrcmp(last1,last2);
else
lastDiff = 0.0;
if (first1 && first2)
firstDiff = fstrcmp(first1,first2);
else
firstDiff = 0.0;
if (last1 && first2)
lfDiff = fstrcmp(last1,first2);
else
lfDiff = 0.0;
if (first1 && last2)
flDiff = fstrcmp(first1,last2);
else
flDiff = 0.0;
//
// We expect last names to be similar, if not identical, and allow for first
// names to be somewhat different in scoring. We also allow for the possibility
// of the first and last names being reversed.
//
mapreg = .7*lastDiff*lastDiff+.3*firstDiff*firstDiff;
mapreverse = .5*lfDiff*lfDiff +.5*flDiff*flDiff;
if (checkReverse)
return EMAX(mapreg,mapreverse);
//
// If the last names are well off, don't give credit for a close
// match on the first name
//
if (lastDiff < .5)
return .7*lastDiff*lastDiff;
FWIW, typing ‘George John’ into the player search page, uschess.org/datapage/player-search.php
will find both “George John” and “John George” as well as “George John Price”. (If there were a “Robert George-John”, it would find him as well.)
However, “John, George” will find “George John” but not the other two.That’s because the comma forces “John” to be limited to searches of the last name field and “George” to be limited to searches of the first name field.
Without the comma, it searches separately for “George” and “John” as substrings of the full name field. (Which means it would find “Georgei Johnson”, too, if there were one.)
The player search program already has the ability to limit the search by state and ratings range, as well as to look at current and upcoming regular, quick and blitz ratings.
A search term to check membership dates might be useful, one to check ages could raise privacy concerns, since that is a web page accessible to the general public.