Optimal pairings. USCF rules, WinTD, and a program I wrote.

In a single section tournament in the later rounds, the rating ranges of the different scoregroups have a significant overlap, and the bottom player of one scoregroup generally has a much lower rating than the top player in the lower scoregroup. You may see something like the 3.5-1.5 players ranging from 1600-2100, the 3-2 players from 1400-2000, the 2.5-2.5 players from 1300-1800 and the 2-3 players ranging from 1200-1700. In such a case, an upfloated top player against a downfloated bottom player will often face a lower rated player than if the upfloated player was instead paired within the same scoregroup against a middle player, and an upfloated top player against a downfloated middle player will often still be paired against a lower-rated player - albeit against an opponent similar in rating to the one that would be obtained by not floating and instead getting paired against the middle of the same scoregroup.

When using Harkness, I found that after the first couple of rounds the upfloated player USUALLY had an opponent similar in rating to the one the next, non-floated player had (albeit there was a 1/2 point score different between the two opponents they had).

Of course, if the higher rated player ALWAYS won then upfloated player would usually be paired up, and Harkness just makes the rating difference in such pairings larger.

I do it every year at OleChess, several times a year.

And 50 players for four rounds may cause the organizer to also accelerate the pairings (sixths instead of fourths - split 8v8/9v9/8v8 - and reasonable results would get it down to one perfect score).

[/quote]
Yes, by accident.

I’ve done a bit more work on my program, enough to allow me to do a bit more manual entries to make some fair comparisons. I have a busy weekend ahead, but I might be able to find time to analyze the “double Harkness” variation and do some different variations.

I found a bug in my color pairings, and after I did that, the situation improved. The rulebook assumes you start with top half/bottom half, and diverge from that in order to solve problems with odd players, due color, etc. The rules are phrased in terms of allowed rating variation to allow divergence. My method starts with meaningless pairings, and converges toward optimal, which will inevitably produce something close to top half/bottom half. It took some mental gymnastics to rewrite the 200/80 swap rules in terms of going toward the best result instead of away from the best result. I’ll see what the effects are as I try it out.

(Edit: I wrote the following post in Microsoft Word, and pasted it here. The nice formatting of course didn’t transfer when I pasted the text below. I should have seen that coming, but I don’t have time to fix it. I have to go to a Chess tournament that starts at noon!)

(Edit again: Tried reformatting. Didn’t work. The forum software seems to think white space is a waste of bits. Also, I don’t know if anyone else has this problem, but when I try to edit after about 15 lines, it’s darned near impossible. It seems to scroll all over the place, but not where I’m typing. Is there some setting somewhere?)

I’ve been working with my program. I think I got the color allocation right now.

I’m very pleased with the results so far. I continued working with an 18 player, 5 round, tournament, using randomly selected players from the USCF database, and assuming no unrated players, no byes, no upsets, and no draws. That’s pretty unrealistic, but it’s just a baseline analysis. It took a long time to do stats on this tournament. As time goes on, I’ll be building in automatic statistical calculations and a “simulation” mode that will randomly do pairs and have a random game outcome generation based on ELO ratings, but that’s a lot of work. For now, I just ran through the pairings on my program, and on WinTD set to Standard Swiss, and wrote things down and exported to Excel and did some stats.

So, how did it do? Pretty darned good. In some ways better than the standard/WinTD method. Mostly, it conformed to my expectations. The rest of this message is a lengthy explanation of findings, so if you aren’t interested in that sort of thing, you can stop reading now. The exact pairings are shown at the end of this message. First I’ll give the analysis.

Comparing overall pairings, as expected, my method produced gradually closer games as the day went on, with the final games happening in the final round. WinTD shoved the games in more rapidly, so that the closest overall games were in round 4, with round 5 being less competitive. Overall, WinTD had less ratings variation. The round by round ratings variations are as follows. Surprisingly, the standard deviations of ratings variation was generally higher with my program than with WinTD. I’m going to have to see if that’s a bug, or a statistical fluke due to ratings distribution. I would expect my program to have less variation between games in a given round.

    1   2   3   4   5    Overall

WinTD 1204 546 602 335 568 604
Lame 1204 746 369 498 347 679

Color allocation:

How often did someone not get due color? Define an “alternating color fault” as one in which a player had the right number of colors assignments, but did not alternate. (e.g. BWW) and an “equalizing color fault” as one in which the wrong number of colors were played. (e.g. WBWW). My program produced two alternating color faults and two equalizing color faults. WinTD produced 8 alternating color faults and one equalizing color fault. At the end of the tournament, all players had played 3 of one color and 2 of another color for both programs.

Intermediate standings:

In a tournament with no upsets, you would expect players to be ranked consistently throughout. With my method, there is never a time in which a player with a higher rating had a lower score. With WinTD, such situations begin in round 3 when the top player in the losers’ bracket from round 1 also lost in round 2. I haven’t done a full analysis including tiebreakers.

Overall conclusions: I like my method, at least for tournaments with enough rounds to do a complete “sort” of the players. I’ll have to do a lot more analysis, not to mention just making sure I didn’t have a data entry error, but it’s looking good. The one thing I didn’t like was that overall, the games weren’t as competitive, and I’ll have to take a look at why. I’ll definitely have to try all sorts of variations, with byes and draws and unrated players and what happens when players’ “true ability” is higher or lower than the rating. Nevertheless, I think it’s promising.

The players below are listed by their ratings.

Lame Method WinTD

673 2225 673 2225
2114 586 2114 586
457 2062 457 2062
1901 397 1901 397
285 1305 285 1305
1229 212 1229 212
114 1179 114 1179
900 112 900 112
105 863 105 863

2225 1229 2225 1229
1179 2114 1305 2114
2062 900 2062 900
863 1901 1179 1901
1305 114 863 673
212 673 586 285
586 285 112 457
112 457 397 114
397 105 212 105

1901 2225 2062 2225
2114 1305 2114 1901
673 2062 1305 863
1229 586 1229 586
457 1179 457 1179
900 397 900 397
114 863 673 212
285 112 285 112
105 212 114 105

2225 2062 2225 2114
863 2114 1179 2062
1179 1901 1901 1305
1305 900 863 1229
397 1229 673 900
285 673 586 114
586 105 212 457
212 457 397 285
112 114 105 112

2114 2225 1901 2225
2062 1305 2114 900
1901 1229 1229 2062
673 1179 1305 586
900 586 397 1179
457 863 457 863
212 397 114 673
114 285 285 105
105 112 112 212

Editing tips:

Try the following, in conjunction with each other. (Either by itself won’t work.)

  1. In Word, use Courier New font for your tabular information. Use spaces (not tabs) to line everything up.

  2. After you paste from Word to your forum reply, ignore the fact that the tabular information doesn’t line up anymore. Enclose the entire table in {code} and {/code} tags (but use instead of {}).

It still won’t look right in the reply window, but once you Submit (or Preview), bingo.

I’m a firm believer in using the keyboard instead of the mouse whenever possible. Various combinations of shift-down-arrow, shift-right-arrow, etc are extremely useful when highlighting for deletion or copying. Then control-C to copy and control-V to paste. With all this, I guess I’m showing my age.

Bill Smythe

18 is pretty close to the “theoretical” number (16) of players for a four-round event. So it’s not surprising that, in your simulation, the climactic games occurred in round 4 rather than round 5.

I think you should speed up the implementation of that feature. Your no-upset, no-draw assumptions could be distorting your entire analysis. For example, if there are just two first-round upsets, both involving white defeating black, the colors will be significantly worse in round 2 than in your “pure” situation. For another example, when players with different scores are paired, the comparison between bottom-vs-top and double-Harkness may look very different in theory than in practice.

Both programs probably would have done much worse in a realistic situation, with upsets and draws.

Also, you may be overlooking the small-tournament effect. With 18 players it may not be so noticeable, but try pairing a theoretical 4-round event with 8 players, making all colors work perfectly in the first three rounds. You won’t like what you’re faced with in round 4. In a small tournament, it’s often better to have a few bad colors in the odd-numbered rounds (alternation) so that things will work better in the even-numbered rounds (equalization).

Bill Smythe

I tried it. I had two equalizing faults in the fourth round. So did WinTD, but it also had two alternating faults in round 3.

I’ve become interested enough that I’m going to have to try to go through with the simulation mode. Ironically, though, I started this in part because I couldn’t figure out how to program the standard method. Now, I want to compare this to the standard method, but in order to do so fairly, I’ll have to program the standard method.

It’ll be fun.

So I’ve got my simulation mode up and running, but I need a bit of advice.

I have a simulation model, and I want to see if anyone has any better ideas.

A real tournament has upsets, and draws, and byes. Here’s how I’m creating them. After a round is paired, each game will choose a random winner. To do that, the formula for the elo ratings will be used to determine the probability of winning, and a random number number will be generated to determine the winner.

First complication: People’s ratings are only an approximation of their true skill, and that will affect pairings. I’m going to generate a “true rating”. THis will be the published rating, plus or minus 100 (random) points. The “true rating” will be used to determine winners, while the published rating will be used to make pairings.

2nd complication: In Chess, white has an advantage. But how much? I’m planning on adding 50 rating points to the white player to determine winners.

3rd complication: A draw model. My first cut, and what’s programmed at the moment, is to just randomly draw a certain number of games. I don’t like the results. So, instead, I’m going to pick a winner using a random number. Then, I’m going to add 100 points to the loser’s “true rating”, and, using the same random number, see if the winner changes. If it stays the same, the result holds. If it changes, the game result is changed to a draw.

4th complication: Unrated players. Their “true rating” will be set to be a randomly selected rating of another player in the tournament, plus or minus 200 points.

So that’s how I’m going to simulate a tournament. I’ll use real data from actual events as my base, and run the tournament 1000 times or so.

Now, so what do I do? Up until now, I’ve been looking at one tournament at a time and saying, “I like these pairings better than those.” That won’t work for 1000 runs. I need criteria that can be evaluated mathematically. What criteria can be used for “good” pairings?

Here’s what I’ve come up with. First, a player shouldn’t get an advantage from a lucky pairing. With no artificial advantages generated by pairings, the final result will be, on average, that the players will finish in the order of their “true rating”. So, at the end of every tournament simulation, a distance function between expected and actual results will be calculated. (Sum of squares of differences between expected and actual finish order.) After 1000 runs, we’ll see if the average distance is smaller for different pairing methods.

Second, we’d like games to be competitive. We can judge how close the games are. Closer is better.

Third: Drama. I’m not sure how to measure this one. I don’t think it’s as important as those above. The only thing I’ve though of is to compare in what round the first and second place finisher actually played each other. Later is better. Possibly a closely related metric could be used. At what round is the last “meaningful” game played? ie. a game that can change the outcoem of the tourney, especially the top spot. Ideally, the winner of the game played in the last round on the top board will determine the winner. I haven’t yet come up with the best mathematical calculation to measure “drama”.

Fourth: Color evenness. Do players get due colors? However, maybe this isn’t as important as it seems. I’m already checking for artificial advantages. Maybe this is only a problem if it gives one player an advantage, and that’s already being measured.

Any other criteria for a “good” set of pairings?

(Oh, and of course I will do this for the Lame method/double Harkness, and for traditional USCF Swiss. Unfortunately, to do that, I have to get those pairings correct, which is waht I was trying to avoid in the first place.)

This isn’t so much a function of the pairing details as it is of the players-to-rounds ratio.

Theoretically, an n-round Swiss is ideal for 2^n players. e.g. 5 rounds should handle 32 players. Due to draws, etc, in practice the number is a bit higher, say around 50 for 5 rounds.

So, when you plan your tournaments, if last-round drama is important to you, you should guesstimate your attendance and then take its base-2 logarithm as your number of rounds.

Bill Smthe

I believe that Elo has this uncertainty built into the model, so if you add in extra random points to shake things up more, you may actually be adding more randomness than exists in reality.

I don’t think so. What happens in real life is that a player plays in a tournament and gets a rating. He then goes home, plays a lot of online chess, reads a book, memorizes an opening or two, whatever. When he shows up at the next tournament, he has a rating of 1017, but based on his work since then, he’s really capable of playing at 1114. Or, maybe he does nothing, stops reading up, and goes to a late night at the bar the night before the next tournament. Under optimal conditions, he will be playing at 1017 level, but on that day, he’s going to be playing at 932.

Even without such factors, there would still be a difference between “true” and published. Consider a computer program that has a rating. The computer won’t get any better or any worse from one tournament to the next. However, its rating will go up and down significantly. And then of course there are sandbaggers that deliberately lower their rating in order to be eligible for a cash prize.

Does any of this really matter? My guess is no, but I want to see. I want to see if different forms of pairings and color assignments might give slight advantages to one playyer over another. To do that, to be fairly accurate in what I do.

Meanwhile, this little exercise has forced me to take a close look at pairings, instead of just entering data into WinTD and putting it on the wall. I’ve implemented a USCF Swiss Style pairing algorithm, and it isn’t giving the same results as WinTD. (I’ve read that SwissSys is more common in other areas, but around here I don’t think I have ever been to a tournament that didn’t use WinTD, including my own tournaments.)

Working with some sample data, I’m running a sample tournament. In round two, the “zero point scoregroup”, ie. the players who didn’t win or draw in round one, looks like this:

586 B (Showing the player’s rating, and the color played in the previous round.)
457 W
397 B
285 W
212 B
114 W
112 B
105 W

WinTD produces pairings.

586 285
112 457
297 114
212 105

I would prefer

586 114
212 457
397 105
112 285
WinTD appears to have swapped 285 and 212, then swapped 114 and 112. I would swap 285 and 114, and then 105 and 112. Both schemes result in perfect color equalization in round 2, but my scheme avoids the top half/bottom half adjustment. WinTD’s requires slightly less ratings differences in the “swapped” players, but my scheme results in less player rank mixup. If all goes according to ratings, my preferred pairings leave the players ranked in rating order, whereas WinTD’s causes player 285 to end up with 0 points, while player 212 has one.

Reading the rulebook, it looks like this falls under TD discretion. Both schemes have swaps of under 200 points, so both look legal.

Is one set of pairings clearly inferior? Is one set a violation of the rules?

Moderator Mode: Off

Just for fun, I ran the above example through SwissSys.

The pairings and colors for that same group of “0” score was identical to what WinTD produced.

Reading 29E5a, 29E5c and 29E5e explains why the interchange of less than 80 points is preferable to the transposition of more than 80 points.
WinTD (and probably SwissSys) have the option to set the interchange limit to some number other than 80 (zero if you want to avoid interchanges if at all possible).

– a 73-point swap –

– a 2-point swap.

– I assume you mean 212 and 114, a 98-point swap –

– a 7-point swap.

It’s true, a lot of players think they have a God-given right to see top-half-vs-bottom-half. They have no such right, of course, but you can avoid some (meritless) acrimony in the tournament hall by sticking with top-vs-bottom. It has that “cop-out” feel, doing what’s smoothest rather than what’s best.

This argument is weak. The odds of all four results (in your eight-player group) coming out “according to ratings” is less than 50%.

Bill Smythe

The rulebook seems to contradict itself. In 29E5d it says, “While interchanges are sometimes necessary, they should not be used if adequate transpositions are possible.” The next section gives an example where adequate transpositions are possible, but says you should use an interchange anyway.

Taking the rules together, and trying to make sense of them, it seems like they are saying that you should first look for 80 point transpositions, and then an 80 point interchange. If there are still problems, then 200 point transpositions are allowed and then, if all else fails, a 200 point interchange is ok. That also matches the behavior of WinTD (and apparently SwissSys), so it looks like another night of coding ahead. And, based on 29E5G, an unrated player can be transposed or interchanged at will. Unless of course you go with variation 29E5H, in which case you can interchange or transpose anyone you feel like in order to get the colors right.

But it’s still more likely than any other outcome.

This is why I want to get the baseline, “standard” implementation correct, and then run simulation modes. After 1000 runs, using reasonable models for draws and upsets, will some pairing systems end up with objectively better outcomes? If I lower the threshold for allowing interchanges, will it result in penalizing some players unfairly?

Speaking of which, I thought of another metric for evaluating pairings, and that is decisiveness. People would prefer prizes handed out based on points earned, rather than tiebreakers, which gives rise to accelerated pairings, and perhaps other methods. The question will be whether the benefit of having a single undisputed winner is purchased by having a bit more randomness in the outcomes. Time will tell.

Tiebreakers are not used for cash prizes, only for indivisible prizes, such as merchandise or trophies.

Accelerated pairings can either increase or decrease the chances of a single winner. If you have 5 rounds and 32 players, you’re spot on with regular pairings. With accelerated, in theory you’ll have a single leader after 4 rounds, and if that player now loses in round 5, you may end up with as many as three or four players tied at 4-1.

Don’t forget, also, that a single winner is not the same as a single perfect score. If the top two or three boards all draw in the last round, you’ll likely end up with a gazillion-way tie for first.

It’s virtually impossible to predict the likelihood of a single winner, based on number of rounds, number of players, pairing details, or anything else.

Bill Smythe

These pairings are going to drive me crazy.

I changed my “uscf style” pairings to first check for 80 point transpositions, then 80 point interchanges, then 200 point transpositions for equalizing, then 200 point interchanges for equalizing. Great. Now, using my sample data, my round 2 matches WinTD’s round 2. On to round 3.

Looking at the top scoregroup, there are four players. None have played each other. Ratings and color history below.

2225 BW
2114 WB
2062 BW
1901 WB

Top half/bottom half and alternating colors gives

2062 2225
2114 1901

That’s where WinTD leaves it, but 2062 and 1901 don’t get due colors. Looking to the bottom of the top half and the top of the bottom half, we have 2114 and 2062. That’s a 52 point difference, and if we make that interchange, we get

2114 2225

1901 2062

Everyone has due color, and it was done with a less than 80 point interchange. It seems like the rules say to make the swap. Any reason WinTD shouldn’t do it?

(ETA: Fixed mistyped ratings above, and gave people proper colors.)