There's been a lot of talk over the last decade about how ratings calculations and estimated win percentages are off and deter highly ranked players from entering tournaments where they would face lower-rated players. In 1999, the NSA Ratings Committee made this website describing research into alternatives, but no changes have been implemented. Note that much of the research follows up on the dissertation work of the Committee's chair Robert Lewis Parker, but links to the dissertation or other materials at unm.edu are all dead (though some web archive sites have preserved some of the relevant materials).
Chris Lipe supplied this abstract for Parker's dissertation:
SMALL SAMPLE AND COMPUTATIONAL IMPROVEMENTS FOR PAIRED COMPARISONS (SAMPLE SIZE) (PARKER, ROBERT LEWIS) Abstract: This dissertation is primarily concerned with the analysis of tournaments of two-player competitions: paired competitions. As such, it is a study of the method of paired comparisons. The method of paired comparisons is appropriate in many situations, including (i) when pairwise comparison is the only available method of judging treatments, (ii) when the cost or complexity of obtaining and analyzing absolute measurements is prohibitive, and (iii) when the outcomes of pairwise competitions are not easily modeled as a function of competitor scores or judged merits. This dissertation offers advances in each of these three cases. For the first case, we improve on the current computational method of parameter estimation for the Bradley-Terry model, and provide a general computational algorithm for the general method. We investigate an alternative to ML parameter estimation (the "Performance Rating" method) for paired comparisons models, complementing the result of [Ste92] that the ordering of the treatments is affected very little by the choice of paired comparisons model--we show that there is also very little difference resulting from the alternative parameter estimation method. We go on to show that the large-sample methods for obtaining precision estimates provided by Bradley [Bra76], perform well for small sample sizes but for only one of his two proposed parameterizations. We present the Win Switching Reestimation (WSR) method, an alternative method for finding confidence intervals for parameters. The WSR method competes closely with the accuracy of appropriate large sample methods over the complete range of sample sizes and tournament configurations considered, and in some cases it is better. For the second case, we investigate the cost of paired comparisons vs. absolute or relative numerical measurements. We show that established large sample results are appropriate for statement of the inefficiency of paired comparisons for small samples, but only when the paired measurements are uncorrelated. We also give inefficiencies for the case of absolute vs. pairwise relative numerical measurements. For the third case, we present analyses of two large tournaments (1997 World Scrabble Championships and the 1998 National Basketball Association season) in which scores for the two players (teams) in each game are available. For a bivariate normal model, we develop a parameter estimation algorithm that is less complex than the standard linear models approach.
Some of this research sounds similar to work done by Mark Glickman, especially "Parameter estimation in large dynamic paired comparison experiments" (1999: Applied Statistics, 48, pp.377-394). (There is a less technical description as well.) "Parameter estimation" is an interesting paper in which Glickman applies a series of approximations to the full maximum likelihood model and winds up with a formula almost identical to the Elo formula (on which the current flawed NSA system is based). The simulations in the paper show the formula performs reasonably well. I don't say tolerably well, because it is not clear to me that the approximations produce results that are within an acceptable tolerance.

Another alternative to Elo is promoted by Jeff Sonas. The graphic below (taken from "The Sonas Rating Formula Better than Elo?") shows a large number of chess games played by rated players. It seems that Sonas is (intentionally or not) conflating the first-mover advantage and a purported underlying linearity in the data. It is clear that White has a 50-point advantage (measured in ratings points), which is not captured by the Elo curve, making the Elo curve a straw man. This is easy to fix by mentally moving the Elo curve on that graph to left by 50 points. That is, the 50% point on the Elo curve should pass through the 50% mark on real data, or equivalently: the curve should reflect the empirical advantage to moving first.

Once you have made that correction, the difference between the logistic curve and a straight line is minimal over the range from -350 to 350. Nowhere over this range would the projected win percentage differ by more than 2% in these two models. I would guess, based on the graphs to follow, that it would be hard to prefer the linear model if Sonas extended his graph 100 points in either direction (though a linear system is used by the Association of British Scrabble Players, capped at a 40-point differential, or about a 400 NSA-rating-point difference).

Looking at the Elo-based NSA ratings curve (and a slight modification implemented in simulations by Robert Parker), together with a linear approximation shows a similar pattern. If we define a piecewise linear function p=max(min(.5+D/800,1),0) where D is the ratings difference and p the estimated probability of winning, it would differ by less than 2% from the current NSA ratings curve over the range from -300 to 300.

This piecewise linear function would hit 1 at 400, and stay there for any higher ratings difference. Symmetrically, it would hit zero at -400, and stay there for any bigger ratings deficit. It could be easily modified to turn horizontal at a lower probability, say 95%, but the abrupt shift from a steep slope to a zero slope seems a bit odd, and the deviation of any such function from observed win proportions would likely be huge for large ratings differences (more than 350, say).
But the fact that it has these nondifferentiable kinks (i.e. points where the curve shifts abruptly from a positive slope to a zero slope, or vice versa) should not necessarily rule it out, since the existing NSA ratings curve becomes a horizontal line at a probability of .9933358 (for ratings advantages of 700 or greater) and .0066642 (for ratings deficits of -700 or bigger). To be perfectly clear: it is not the kink in the curve that makes the linear or NSA schedule unattractive, it is their poor fit to real data that is unappealing. The same would be true of a NSA curve modified to cap projected win probabilities at 85% (as proposed on cgp).
There is certainly no clear reason to prefer a linear function, since software would do all the calculations anyway. The logistic curve is easy to do calculations with, too, and has the dubious advantage of looking like a known probability cumulative distribution function (CDF)--I say dubious because there is no real reason the expectations curve has to look like a CDF, though it is appealing to think it should be zero on the left, one on the right, and be nondecreasing in between, and perhaps have some limited continuity properties (all four properties of a CDF). Unfortunately, none of these curves can really provide good estimates of the win probability for two players with ratings X and Y.

The central problem arises from using only the difference in ratings. Assume you have parametrized the logistic so it fits the empirical win probability of a 1300 rated player playing players rated from 600 to 2000. You now have a win probability defined over the range of all ratings differentials from -700 to 700. Will this fit the empirical win probability of an 1100 rated player playing players rated from 600 to 1800? There is no reason to think it will--the rating numbers are a rank ordering, not an interval-ratio-level measurement (i.e. they are not like temperature in degrees Kelvin, where the distance from 273 to 274 has the same physical interpretation as the distance from 300 to 301).

There is a solution--define the projected win probability as a function of two ratings, Player 1 and Player 2 (it makes sense to keep track of which player moved first, since even though the game is not solvable through backwards induction the way chess is, there does seem to be an advantage to playing first). Once you define the projected probability of winning over any two pairs of ratings, the ratings scheme should be nearly self-enforcing, in the sense that the empirical win percentages should not move around too much, and the projected win probabilities will by definition be close to observed proportions for all possible matchups.

Note that estimating the projected win probability as a function of two ratings implies fitting a surface to the data, rather than a curve, so a 200-point advantage would imply a different probability of winning for two players rated 1300 and 1100 than two players rated 1600 and 1400. One way of doing this would be to estimate kernel regressions of win on opponent's rating, for first and second movers, for each ratings level, using tournament data over a medium-length timespan, say over a two-year window. Kernel regression, or local polynomial regression, is a non-parametric (this should be in quotes, because of course there are parameters, but that's what the techniques are called) way of fitting a curve to data. The main parameter is the bandwidth of the kernel, which controls how smooth the fitted curve looks, but the point is mainly that you do not constrain the curve to be in the linear, logistic, or other family of functions--it is free to match the data very closely.

Some preliminary investigations using 7 years of tournament data from cross-tables.com demonstrates that the ogive ratings curve used by the NSA seems to systematically underestimate the winning chances of the lower-rated player and systematically overestimate the winning chances of the higher-rated player. This may be due to the total absence of any model of luck in the theory underlying the Elo system (developed for chess, where there is no luck component, essentially). A model that includes luck would likely predict that the mean influence of chance on the expected outcome of the game is different for different pairs of ratings (i.e. stronger players rely less on lucky draws) which would mean that the standard deviation used in an Elo-style model should vary across ratings levels, either producing curves with different standard deviations for each ratings level (for an example of a what a ratings curve might look like using a single fixed standard deviation parameter that exceeds the NSA parameter by a factor of two, click here) or, if the standard deviation should vary with both players' ratings, producing curves that look nothing like a cumulative normal.

Note, however, that any alternative ratings curve (estimated from past data, say) would have to be applied to past events in some way (though the pairings would have been different in past tournaments under the counterfactual hypothesis of using a different ratings curve) to see if a proposed alternative continued to match past performance once "revised" ratings were calculated. Tricky stuff.

To paraphrase an email from Steven Alexander, it is possible that the problem is not the rating curve. John Chew's tests showed a corrected curve would need to be repeatedly corrected, and a discussion in The Mathematics of Games, by John D. Beasley (1989 Oxford), Ch. 5 (which was copied to cgp on 12 Dec 2003) argued that the families of ratings systems under consideration are not particularly sensitive to the choice of a ratings curve. The system's problems can be overstated compared to what else is possible because the desired qualities are inconsistent.

In a sense, I agree with both of these points. If the NSA committed to using a new ratings curve, whether a cumulative normal, or a logistic, or a piecewise linear curve, it would make little difference in the long run (say, after a year or so). Individual ratings might be different, but percentile ranks should be largely unchanged (at least, in the thought experiment where ability is largely unchanged), because the interpretation of a rating X, or a ratings advantage N, would change just enough to drive everyone back to the same situation. If you want to match the empirical distribution of win percentages by ratings differentials, and you change the ratings curve, then you change what a given ratings advantage means, and ratings will change, perhaps until you are right back in a similar position as you were before. I think this is part of what Steven Alexander meant by "the system's problems can be overstated compared to what else is possible because the desired qualities are inconsistent." The other part might concern the occasional proposals to modify the multiplier, or bonus points, or what have you.

Let me call "hypothesis A" the notion that "a change to the ratings curve would be followed by a short period where actual win percentages match predicted ones, followed by a gradual return to the previous situation, as described above." Now let me state a conjecture: hypothesis A is true when the same ratings curve applies to all ratings level, but not otherwise. I think this is true because a ratings system that assigns numbers and treats them as interval-level data (when they are in fact ordinal) is going to run into this problem. But if you allow the ratings curve to differ by ratings level, you are treating the numbers as a level of data that is a kind of hybrid between interval-level data and ordinal-level data. If you didn't understand this discussion, stay tuned. Once I work out the details, I'll write up a version that someone who hasn't taken a statistics course will find easy to understand.