Chris Lipe supplied this abstract for Parker's dissertation:

SMALL SAMPLE AND COMPUTATIONAL IMPROVEMENTS FOR PAIRED COMPARISONS (SAMPLE SIZE) (PARKER, ROBERT LEWIS) Abstract: This dissertation is primarily concerned with the analysis of tournaments of two-player competitions: paired competitions. As such, it is a study of the method of paired comparisons. The method of paired comparisons is appropriate in many situations, including (i) when pairwise comparison is the only available method of judging treatments, (ii) when the cost or complexity of obtaining and analyzing absolute measurements is prohibitive, and (iii) when the outcomes of pairwise competitions are not easily modeled as a function of competitor scores or judged merits. This dissertation offers advances in each of these three cases. For the first case, we improve on the current computational method of parameter estimation for the Bradley-Terry model, and provide a general computational algorithm for the general method. We investigate an alternative to ML parameter estimation (the "Performance Rating" method) for paired comparisons models, complementing the result of [Ste92] that the ordering of the treatments is affected very little by the choice of paired comparisons model--we show that there is also very little difference resulting from the alternative parameter estimation method. We go on to show that the large-sample methods for obtaining precision estimates provided by Bradley [Bra76], perform well for small sample sizes but for only one of his two proposed parameterizations. We present the Win Switching Reestimation (WSR) method, an alternative method for finding confidence intervals for parameters. The WSR method competes closely with the accuracy of appropriate large sample methods over the complete range of sample sizes and tournament configurations considered, and in some cases it is better. For the second case, we investigate the cost of paired comparisons vs. absolute or relative numerical measurements. We show that established large sample results are appropriate for statement of the inefficiency of paired comparisons for small samples, but only when the paired measurements are uncorrelated. We also give inefficiencies for the case of absolute vs. pairwise relative numerical measurements. For the third case, we present analyses of two large tournaments (1997 World Scrabble Championships and the 1998 National Basketball Association season) in which scores for the two players (teams) in each game are available. For a bivariate normal model, we develop a parameter estimation algorithm that is less complex than the standard linear models approach.

Some of this research sounds similar to work done by Mark Glickman, especially "Parameter estimation in large dynamic paired comparison experiments" (1999: Applied Statistics, 48, pp.377-394). (There is a less technical description as well.) "Parameter estimation" is an interesting paper in which Glickman applies a series of approximations to the full maximum likelihood model and winds up with a formula almost identical to the Elo formula (on which the current flawed NSA system is based). The simulations in the paper show the formula performs reasonably well. I don't say tolerably well, because it is not clear to me that the approximations produce results that are within an acceptable tolerance.

Another alternative to Elo is promoted by Jeff Sonas. The graphic below (taken from "The Sonas Rating Formula – Better than Elo?") shows a large number of chess games played by rated players. It seems that Sonas is (intentionally or not) conflating the first-mover advantage and a purported underlying linearity in the data. It is clear that White has a 50-point advantage (measured in ratings points), which is not captured by the Elo curve, making the Elo curve a straw man. This is easy to fix by mentally moving the Elo curve on that graph to left by 50 points. That is, the 50% point on the Elo curve should pass through the 50% mark on real data, or equivalently: the curve should reflect the empirical advantage to moving first.

Looking at the Elo-based NSA ratings curve (and a slight modification implemented in simulations by Robert Parker), together with a linear approximation shows a similar pattern. If we define a piecewise linear function p=max(min(.5+D/800,1),0) where D is the ratings difference and p the estimated probability of winning, it would differ by less than 2% from the current NSA ratings curve over the range from -300 to 300.

The central problem arises from using only the *difference* in
ratings. Assume you have parametrized the logistic so it fits the
empirical win probability of a 1300 rated player playing players rated
from 600 to 2000. You now have a win probability defined over the range
of all ratings differentials from -700 to 700. Will this fit the
empirical win probability of an 1100 rated player playing players rated
from 600 to 1800? There is no reason to think it will--the rating numbers
are a rank ordering, not an interval-ratio-level measurement (i.e. they
are not like temperature in degrees Kelvin, where the distance from 273 to
274 has the same physical interpretation as the distance from 300 to 301).

Some preliminary investigations using 7 years of tournament data from cross-tables.com demonstrates that the ogive ratings curve used by the NSA seems to systematically underestimate the winning chances of the lower-rated player and systematically overestimate the winning chances of the higher-rated player. This may be due to the total absence of any model of luck in the theory underlying the Elo system (developed for chess, where there is no luck component, essentially). A model that includes luck would likely predict that the mean influence of chance on the expected outcome of the game is different for different pairs of ratings (i.e. stronger players rely less on lucky draws) which would mean that the standard deviation used in an Elo-style model should vary across ratings levels, either producing curves with different standard deviations for each ratings level (for an example of a what a ratings curve might look like using a single fixed standard deviation parameter that exceeds the NSA parameter by a factor of two, click here) or, if the standard deviation should vary with both players' ratings, producing curves that look nothing like a cumulative normal.

Note, however, that any alternative ratings curve (estimated from past data, say) would have to be applied to past events in some way (though the pairings would have been different in past tournaments under the counterfactual hypothesis of using a different ratings curve) to see if a proposed alternative continued to match past performance once "revised" ratings were calculated. Tricky stuff.

To paraphrase an email from Steven
Alexander, it is possible that the problem
is **not** the rating curve. John
Chew's tests showed a corrected curve would
need to be repeatedly corrected, and a discussion in *The Mathematics
of
Games*, by John D. Beasley (1989 Oxford), Ch. 5 (which
was copied to cgp on 12 Dec 2003) argued that the families of ratings
systems
under consideration are not particularly sensitive to the choice of a ratings
curve. The system's problems can be overstated **compared to what else is
possible** because the desired qualities are inconsistent.

In a sense, I agree with both of these points. If the NSA committed to
using a new ratings curve, whether a cumulative normal, or a
logistic, or a piecewise linear curve, it would make little difference in
the long run (say, after a year or so). Individual ratings might be
different, but percentile ranks should be largely unchanged (at least, in
the thought experiment where ability is largely unchanged), because the
interpretation of a rating X, or a ratings advantage N, would change just
enough to drive everyone back to the same situation. If you want to match
the empirical distribution of win percentages by ratings differentials,
and you change the ratings curve, then you change what a given ratings
advantage means, and ratings will change, perhaps until you are right back
in a similar
position as you were before. I think this is part of what Steven
Alexander meant by "the system's problems can be overstated
**compared to what else is possible** because the desired qualities are
inconsistent." The other part might concern the occasional proposals to modify
the multiplier, or bonus points, or what have you.

Let me call "hypothesis A" the notion that "a change to the ratings curve would be followed by a short period where actual win percentages match predicted ones, followed by a gradual return to the previous situation, as described above." Now let me state a conjecture: hypothesis A is true when the same ratings curve applies to all ratings level, but not otherwise. I think this is true because a ratings system that assigns numbers and treats them as interval-level data (when they are in fact ordinal) is going to run into this problem. But if you allow the ratings curve to differ by ratings level, you are treating the numbers as a level of data that is a kind of hybrid between interval-level data and ordinal-level data. If you didn't understand this discussion, stay tuned. Once I work out the details, I'll write up a version that someone who hasn't taken a statistics course will find easy to understand.