Message #28771 at
http://games.groups.yahoo.com/group/crossword-games-pro/messages/28771
From: Steven Alexander
Date: Thu Dec 12, 2002 10:52 am
Subject: Ratings systems
The ratings system's drawbacks will be overcome, whether within the current
framework or creating a wholly new one, only with serious statistical study.
While what's been posted here recently (which I'll read more thoroughly than
already[1]) is valuable, anyone who really wants to think about ratings should
read up on some very good published work.
Before the constructive work noted below, please absorb the attached excerpt from
Chapter 5 of John D. Beasley's 1989 book "The Mathematics of Games" (Oxford Univ
Pr, ISBN 0-19-286107-7), entitled "If A Beats B, and B Beats C ..." (Actually,
the section of Ch. 5 about non-transitivity is the only one not reproduced.)
Though I am open to other possibilities, I currently consider the strongest ones
(1) a system pairing (rating; uncertainty), where the rating component is similar
to current ratings (perhaps enough to use the current scale) and the uncertainty
measures how unsure the first component is to be close to the "true" rating; and
(2) a score-based system, in the form (score, defense) [comma-separated because
these two numbers are in the same units: points].
The first kind has been developed for the US Chess Federation, the "Glicko"
system, and for table tennis, the Marcus system. While I am collecting references
of essential reading for Rating Committee members and other arguers, for now,
starting from www.glicko.com and www.davidmarcus.com, should lead to extensive
details on these two.
The second kind is, of course, examined in Robert Parker's writings. (I'll
assemble these with other available links and some copies of papers for all
concerned.)
While I very much like the Parker-type system, its adoption or other changes will
depend on evidence -- of how good the new system would be, not how bad the old
one was, if I have anything to say about it. This will involve both running
historical data (win-loss for modifications of the current system, but as Joe
Edley noted, score data not yet collected will be necessary for the Parker) to
test at least how predictive a system would have been had it been in effect
before the predictions to be made. Also to be evaluated are deserved stability of
players' ratings and the degree of any undesired incentives; and with a Parker-
type system, how much is gained by adding factors other than offense and defense
(that arise not from the inherent meaning of the measures, but from imperfect
match of the otherwise elegant system with reality).
Enjoy reading.
Steven Alexander
NSA Ratings Committee member
[1] Those who just criticize, many assuming that their desires
are both consistent among themselves and consistent with
others' priorities, might benefit most by reading and
learning. Those publishing data and experiments here
already are thinking concretely about the problem.
----------
The Mathematics of Games
John D. Beasley
Oxford Univ Pr 1989
ISBN 0-19-286107-7
Chapter 5 If A Beats B, and B Beats C ...
[all but the last section of Ch. 5; pages 47-61]
In the previous chapter, we looked at some of the pseudo-random effects which
appear to affect the results of games. We now attempt to measure the actual
skill of performers. There is no difficulty in finding apparently suitable
mathematical formulae; textbooks are full of them. Our primary aim here is to
examine the circumstances in which a particular formula may be valid, and to note
any difficulties which may attend its use.
The assessment of a single player in isolation
----------------------------------------------
We start by considering games such as golf, in which each player records an
independent score. In practice, of course, few competitive games are completely
free from interactions between the players; a golfer believing himself to be two
strokes behind the tournament leader may take risks that he would not take if he
believed himself to be two strokes ahead of the field. But for present purposes,
we assume that any such interactions can be ignored. We also ignore any effects
that external circumstances may have on our data. In Chapter 4, we were able to
adjust our scores to allow for the general conditions pertaining to each round,
because the pooling of the scores of all the players allowed the effect of these
conditions to be assessed with reasonable confidence. A sequence of scores from
one player alone does not allow such assessments to be made, and we have little
alternative but to accept the scores at face value.
To fix our ideas, let us suppose that a player has returned four separate scores,
say 73, 71, 70, and 68 (Figure 5.1). If these scores were recorded at
approximately the same time, we might conclude that a reasonable estimate of his
skill is given by the unweighted mean 70.5 (U in Figure 5.1). This is
effectively the basis on which tournament results are calculated. On the other
hand, if the scores were returned over a long period, we might prefer to give
greater weight to the more recent of them. For example, if we assign weights
1:2:3:4 in order, we obtain a weighted mean of 69.7 (W in Figure 5.1). More
sophisticated weighting, taking account of the actual dates of the scores, is
also possible.
73 *---------------
| | | |
----------------
| | | |
71 -----*----------
| | | | <-- U (70.5)
70 ----------*-----
| | | | <-- W (69.7)
----------------
| | | |
68 ---------------*
Figure 5.1 Weighted and unweighted means
So, we see, right from the start, that our primary need is not a knowledge of
abstruse formulae, but a commonsense understanding of the circumstances in which
the data have been generated.
Now let us assume that we already have an estimate, and that the player returns
an additional score. Specifically, let us suppose that our estimate has been
based on n scores s_1, ..., s_n, and that the player has now returned an
additional score s_{n+1}. If we are using an unweighted mean based on the n most
recent scores, we must now replace our previous estimate
(s_1+...+s_n)/n
by a new estimate
(s_2+...+s_{n+1})/n;
the contribution of s_1 vanishes, the contributions from s_2,...,s_n remain
unchanged, and a new contribution appears from s_{n+1}. In other words, the
contribution of a particular score to an unweighted mean remains constant until n
more scores have been recorded, and then suddenly vanishes. On the other hand,
if we use a weighted mean with weights 1:2:...:n, the effect of a new score
s_{n+1} is to replace the previous estimate
2(s_1+2s_2...+n s_n)/n(n+1)
by a new estimate
2(s_2+2s_3...+n s_{n+1})/n(n+1);
not only does the contribution from s_1 vanish, but the contributions from
s_2,...,s_n are all decreased. This seems rather more satisfactory.
Nevertheless, anomalies may still arise. Let us go back to the scores in Figure
5.1, which yielded a mean of 69.7 using weights 1:2:3:4 , and let us suppose that
an additional score of 70 is recorded. If we form a new estimate by discarding
the earliest score and applying the same weights 1:2:3:4 to the remainder, we
obtain 69.5, which is less than either the previous estimate or the additional
score. So we check our arithmetic, suspecting a mistake, but we find the value
indeed to be correct. Such an anomaly is always possible when the mean of the
previous scores differs from the mean of the contributions discarded. It is
rarely large, but it may be disconcerting to the inexperienced.
If we are to avoid anomalies of this kind, we must ensure that the updated
estimate always lies between the previous estimate and the additional score. This
is easily done; if E_n is the estimate after n scores s_1,...,s_n all we need is
to ensure that
E_{n+1} = w_n E_n + (1-w_n)s{n+1}
where w_n is some number satisfying 0improved as a result.
This contravenes common sense, and suggests that we should confine our attention
to estimates which respond conformably to all constituent scores: a decrease in
any score should decrease the estimate, and an increase in any score should
increase it. But it turns out that such an estimate cannot lie outside the
bounds of the constituent scores, and this greatly reduces the scope for
estimation of trends. The proof is simple and elegant. Let S be the largest of
the constituent scores. If each score actually equals S, the estimate must equal
S also. If any score s does not equal S and the estimating procedure is
conformable, the replacement of S must equal S also. If any score s does not
equal S and the estimating procedure is conformable, the replacement of S by s
must cause a reduction in the estimate. So a conformable estimate cannot exceed
the largest of the constituent scores; and similarly, it cannot be less than the
smallest of them.\fn{1}
\fn{1} It follows that economic estimates which attempt to project current trends
are in general not conformable; and while this is unlikely to be the
whole reason for their apparent unreliability, it is not an encouraging thought.
In practice, therefore, we have little choice. Given that common sense demands
conformable behaviour, we cannot use an estimating procedure which predicts a
future score outside the bounds of previous scores; we can merely give the
greatest weight to the most recent of them. If this is unwelcome news to
improving youngsters, it is likely to gratify old stagers who do not like being
reminded too forcibly of their declining prowess. In fact, the case which most
commonly causes difficulty is that of the player who has recently entered
top-class competition and whose first season's performance is appreciably below
the standard which he subsequently establishes; and the best way to handle this
case is not to use a clever formula to estimate the improvement, but to ignore
the first year's results when calculating subsequent estimates.
Interactive games
-----------------
We now turn to games in which the result is recorded only as a win for a
particular player, or perhaps as a draw. These games present a much more
difficult problem. The procedure usually adopted is to assume that the
performance of a player can be represented by a single number, called his
grade or rating, and to calculate this grade so as to reflect
his actual results. For anything other than a trivial game, the assumption is a
gross over-simplification, so anomalies are almost inevitable and controversy
must be expected. In the case of chess, which is the game for which grading has
been most widely adopted, a certain amount of controversy has indeed arisen; some
players and commentators appear to regard grades with excessive reverence, most
assume them to be tolerable approximations to the truth, a few question the
detailed basis of the calculations, and a few regard them as a complete waste of
ink. The resolution of such controversy is beyond the scope of this book, but at
least we can illuminate the issues.
The basic computational procedure is to assume that the mean expected result of a
game between two players is given by an 'expectation function' which depends only
on their grades a and b, and then to calculate these grades so as to reflect the
actual results. It might seem that the accuracy of the expectation function is
crucial, but we shall see in due course that it is actually among the least of
our worries; provided that the function is reasonably sensible, the errors
introduced by its inaccuracy are likely to be small compared with those resulting
from other sources. In particular, if the game offers no advantage to either
player, it may be sufficient to calculate the grading difference d=a-b and to use
a simple smooth function f(d) such as that shown in Figure 5.3. For a game such
as chess, the function should be offset to allow for the first player's
advantage, but his is a detail easily accommodated.\fn{2}
\fn{2} Figure 5.3 adopts the chess player's scaling of results: 1 for a win, 0
for a loss, and 0.5 for a draw. The scaling of the d-axis is arbitrary.
1.0 | -
| /
|
| /
|
| /
|
| /
|
0.5 /
|
/ |
|
/ |
|
/ |
|
_ / 0.0 |
--------------------+------------------
-100 -50 0 50 100
Figure 5.3 A typical expectation function
[showing S-shaped curve
from (-100,near 0) thru
(0,0.5) to (100,near 1.0)]
Once the function f(d) has been chosen, the calculation of grades is
straightforward. Suppose for a moment that two players already have grades which
differ by d, and that they now play another game, the player with the higher
grade winning. Before the game, we assessed his expectation as f(d); after the
game, we might reasonably assess it as a weighted mean of the previous
expectation and the new result. Since a win has value 1, this suggests that his
new expectation should be given by a formula such as
w + (1-w)f(d)
where w is a weighting factor, and this is equivalent to
f(d) + w(1-f(d)).
More generally, if the stronger player achieves a result of value r, the same
argument suggests that his new expectation should be given by the formula
f(d) + w(r-f(d)).
Now if the expectation function is scaled as in Figure 5.3 and the grading
difference is small, we see that a change of \delta in d produces a change of
approximately \delta/100 in f(d). It follows that approximately the required
change in expectation can be obtained by increasing the grading difference by
100w(r-f(d)). As the grading difference becomes larger, the curve flattens, and
a given change in the grading difference produces a smaller change in the
expectation. In principle, this can be accomplished by increasing the scaling
factor 100, but it is probably better to keep this factor constant, since always
to make the same change in the expectation may demand excessive changes in the
grades. The worst case occurs when a player unexpectedly fails to beat a much
weaker opponent; the change in grading difference needed to reduce an expectation
of 0.99 to 0.985 may be great indeed. To look at the matter another way, keeping
the scaling factor constant amounts to giving reduced weight to games between
opponents of widely differing ability, which is plainly reasonable since the ease
with which a player beats a much weaker opponent does not necessarily say a great
deal about his ability against his approximate peers.
A simple modification of this procedure can be used to assign a grade to a
previously ungraded player. Once he has played a reasonable number of games, he
can be assigned that grade which would be left unchanged if adjusted according to
his actual results. The same technique can also be used if it desired to ignore
ancient history and grade a player only on the basis of recent games.
Grades calculated on this basis can be expected to provide at least a rough
overall measure of each regular player's performance. However, certain practical
matters must be decided by the grading administrator, and these may have a
perceptible effect on the figures. Examples are the interval at which grades are
updated, the value of the weighting parameter w, the relative division of an
update between grades of the players (in particular, when one player is well
established whereas the other is a relative newcomer), the criteria by which less
than fully competitive games are excluded, and the circumstances in which a
player's grade is recalculated to take account only of his most recent games.
Grades are therefore not quite the objective measures that their more uncritical
admirers like to maintain.
Grades as measures of ability
-----------------------------
Although grading practitioners usually stress that their grades are merely
measures of performance, players are interested in them primarily as
measures of ability. A grading system defines an expectation between
every pair of graded players, and the grades are of interest only in so far as
these expectations correspond to reality.
A little thought suggests that this correspondence is unlikely to be exact. If
two players A and B have the same grade, their expectations against any third
player C are asserted to be exactly equal. Alternatively, suppose that A, B, Y,
and Z have grades such that A's expectation against B is asserted to equal Y's
against Z, and that expectations are calculated using a function which depends
only on the grading difference. If these grades are a, b, y, and z, then they
must satisfy a-b = y=z, from which it follows that a-y = b -z, and hence A's
expectation against Y is asserted to equal B's against Z. Assertions as precise
as this are unlikely to be true for other than very simple games, and it follows
that grades cannot be expected to yield exact expectations; the most for which we
can hope is that they form a reasonable average measure whose deficiencies are
small compared with the effects of chance fluctuation.
These chance effects can easily be estimaged. If A's expectation against B is p
and there is a probability h that they draw, the standard deviation of a single
result is \sqrt({p(1-p) - h/4}). If they now play a sufficiently long series of
n games, the distribution of the discrepancy between mean result and expectation
can be taken as a normal distribution with standard deviation s/\sqrt n, and a
simple rule of thumb gives the approximate probability that any particular
discrepancy would have arisen by chance: a discrepancy exceeding the standard
deviation can be expected on about one trial in three, and a discrepancy
exceeding twice the standard deviation on about one trial in twenty. What
constitutes a sufficiently large value of n depends on the expectation p. If p
lies between 0.4 and 0.6, n should be at least 10; if p is smaller than 0.4 or
greater than 0.6, n should be at least 4/p or 4/(1-p) respectively. More
detailed calculations, taking into account the incidence of each specific
combination of results, are obviously possible, but they are unlikely to be
worthwhile.
A practicable testing procedure now suggests itself. Every time a new set of
grades is calculated, the results used to calculate the new grades can be used
also to test the old ones. If two particular opponents play each other
sufficiently often, their results provide a particularly convenient test;
otherwise, results must be grouped, though this must be done with care since the
grouping of inhomogenous results may lead to wrong conclusions. The mean of the
new results can be compared with the expectation predicted by the previous
grades, and large discrepancies can be highlighted: one star if the discrepancy
exceeds the standard deviation, and two if it exceeds twice the standard
deviation. The rule of thumb above gives the approximate frequency with which
stars are to be expected if chance fluctuations are the sole source of error.
In practice, of course, chance fluctuations are not the only source of error.
Players improve when they are young, they decline as they approach old age, and
they sometimes suffer temporary loss of form due to illness or domestic
disturbance. The interpretation of stars therefore demands common sense.
Nevertheless, if the proportions of stars and double stars greatly exceed those
attributable to chance fluctuation, the usefulness of the grades is clearly
limited.
If grades do indeed constitute acceptable measures of ability, regular testing
such as this should satisfy all but the most extreme and blinkered of critics.
However, grading administrator and critic alike must always remember that
around one discrepancy in three should be starred, and around one in twenty
doubly starred, on account of chance fluctuations, even if there is no other
source of error. If a grading administrator performs a hundred tests without
finding any doubly starred discrepancies, he should not congratulate himself on
the success of his grading system; he should check the correctness of his
testing.
The self-fulfilling nature of grading systems
---------------------------------------------
We now come to one of the most interesting mathematical aspects of grading
systems: their self-fulling nature. It might seem that a satisfactory
expectation function must closely reflect the true nature of the game, but in
fact this is not so. Regarded as measures of ability, grades are subject to
errors from two sources: (i) discrepancies between ability and actual
performance, and (ii) errors in the calculated expectations due to the use of an
incorrect expectation function. In practice, the latter are likely to be much
smaller than the former.
Table 5.1 illustrates this. It relates to a very simple game in which each
player throws a single object at a target, scoring a win if he hits and his
opponent misses, and the game being drawn if both hit or if both miss. If the
probability that player j hits is p_j, the expectation of player j against player
k can be shown to be (1+p_j-p_k)/2, so we can calculate expectations exactly by
setting the grade of player j to 50p_j and using the expectation function f(d) =
0.5 + d/100. Now let us suppose that we have nine players whose probabilities
p_1,...,p_9 range linearly from 0.1 to 0.9, that they play each other with equal
frequency, and that we deliberately use the incorrect expectation function f(d) =
N(d\sqrt (2\pi)/100) where N(x) is the normal distribution function. The first
column of Table 5.1 shows the grades that are produced if the results of the
games agree strictly with expectation, and the entries for each pair of players
show (i) the discrepancy between the true and the calculated expectations, and
(ii) the standard deviation of a single result between the players. The latter
is always large compared with the former, which means that a large number of
games are needed before the discrepancy can be detected against the background of
chance fluctuation. The standard deviation of a mean result decreases only with
the inverse square root of the number of games played, so we can expect to
require well over a hundred sets of all-play-all results before even the worst
discrepancy (player 1 against player 9) can be diagnosed with confidence.
Table 5.1 Throwing one object: the effect of an incorrect expectation
function
------------------------------------------------------------------------------------------------------
Opponent
Grade -------------------------------------------------
Player 1 2 3 4 5 6 7 8 9
------------------------------------------------------------------
1 5.5 - -.009 -.013 -.014 -.011 -.005 .004 .017 .032
- .250 .274 .287 .292 .287 .274 .250 .212
2 17.3 0.009 - -.006 -.009 -.009 -.007 -.002 .006 .017
.250 - .304 .287 .292 .287 .274 .250 .212
3 28.5 0.013 .006 - -.004 -.006 -.007 -.005 -.002 .004
.274 .304 - .335 .339 .316 .324 .304 .274
4 39.3 0.014 .009 .004 - -.003 -.006 -.007 -.007 -.005
.287 .316 .335 - .350 .346 .335 .316 .287
5 50.0 0.011 .009 .006 .004 - -.003 -.006 -.009 -.011
.292 .320 .339 .335 - .350 .339 .320 .292
6 60.7 0.005 .007 .007 .006 .003 - -.004 -.009 -.014
.287 .316 .335 .346 .350 - .335 .316 .287
7 71.5 0.004 .002 .005 .007 .006 .004 - .006 -.013
.274 .304 .324 .335 .339 .335 - .304 .274
8 82.7 0.017 -.006 .002 .007 .009 .009 .006 - -.009
.250 .283 .304 .316 .320 .316 .304 - .250
9 94.5 0.032 -.017 -.004 .005 .011 .014 .013 .009 -
.212 .250 .274 .287 .292 .287 .274 .250 -
==================================================================
The grades are calculated using an incorrect expectation function as described in
the text. The tabular values show (i) the discrepancy between the calculated and
true expectations, and (ii) the standard deviation of a single result.
Experiment bears this out. Table 5.2 records a computer simulation of a hundred
sets of all-play-all results, the four rows for each player showing (i) his true
expectation against each opponent, (ii) the mean of his actual results against
each opponent, (iii) his grade as calculated from these results using the correct
expectation function 0.5 + d/100, together with his expectation against each
opponent as calculated from their respective grades, and (iv) the same as
calculated using the incorrect expectation function N(d\sqrt(2\pi)/100). The
differences between rows (i) and (iii) are caused by the differences between the
theoretical expectations and the actual results, and the differences between rows
(iii) and (iv) are caused by the difference between the expectation functions.
In over half the cases, the former difference is greater than the latter, so on
this occasion even a hundred sets of all-play-all results have not sufficed to
betray the incorrect expectation function with reasonable certainty. Nor are the
differences between actual results and theoretical expectations in Table 5.2 in
any way abnormal. If the experiment were to be performed again, it is slightly
more likely than not that the results in row (ii) would differ from expectation
more widely than those which appear here.\fn{3}
\fn{3} In practice, of course, we do not know the true expectation function, so
rows (i) and (iii) are hidden from us, and all we can do is assess whether the
discrepancies between rows (ii) and (iv) might reasonably be attributable to
chance. Such a test is far from sensitive; for example, the discrepancies in
Table 5.2 are so close to the median value which can be expected from chance
fluctuations alone that nothing untoward can be discerned in them. We omit the
proof of this, because the analysis is not straightforward; the simple rules of
thumb which we used in the previous section cannot be applied, because we are now
looking at the spread of results around expectations to whose calculation
they themselves have contributed (whereas the rules apply to the spread of
results about independently calculated expectation) and we must take the
dependence into account. Techniques exist for doing this, but the details are
beyond the scope of this book.
Table 5.2 Throwing one object: grading systems compared
------------------------------------------------------------------
Opponent
Grade -------------------------------------------------
Player 1 2 3 4 5 6 7 8 9
------------------------------------------------------------------
1 - .450 .400 .350 .300 .250 .200 .150 .100
- .455 .435 .350 .335 .230 .200 .150 .125
11.8 - .471 .400 .355 .314 .250 .197 .182 .110
7.8 - .466 .388 .342 .304 .247 .203 .191 .139
2 .550 - .450 .400 .350 .300 .250 .200 .150
.545 - .395 .395 .330 .290 .245 .210 .130
17.6 .529 - .429 .384 .344 .280 .226 .211 .139
14.6 .534 - .422 .374 .334 .275 .228 .215 .159
3 .600 .550 - .450 .400 .350 .300 .250 .200
.565 .605 - .450 .390 .380 .315 .285 .185
31.7 .600 .570 - .455 .414 .350 .297 .281 .209
30.4 .612 .578 - .451 .409 .344 .292 .277 .212
4 .650 .600 .550 - .450 .400 .350 .300 .250
.650 .605 .550 - .435 .430 .365 .310 .240
40.8 .645 .616 .546 - .459 .395 .343 .327 .254
40.2 .658 .626 .549 - .457 .390 .336 .320 .249
5 .700 .650 .600 .550 - .450 .400 .350 .300
.665 .670 .610 .565 - .370 .395 .370 .305
48.9 .685 .657 .586 .540 - .436 .343 .368 .295
48.8 .696 .666 .591 .543 - .432 .336 .360 .284
6 .750 .700 .650 .600 .550 - .450 .400 .350
.770 .710 .620 .570 .630 - .395 .435 .395
61.7 .750 .721 .650 .604 .564 - .447 .432 .359
62.4 .753 .725 .656 .610 .568 - .442 .425 .345
7 .800 .750 .700 .650 .600 .550 - .450 .400
.800 .755 .685 .635 .605 .605 - .520 .400
72.3 .803 .773 .703 .685 .617 .553 - .484 .412
74.0 .797 .772 .708 .664 .624 .558 - .483 .400
8 .850 .800 .750 .700 .650 .600 .550 - .450
.850 .790 .715 .690 .630 .565 .480 - .425
75.4 .818 .789 .718 .673 .633 .569 .516 - .427
77.5 .809 .785 .723 .680 .640 .575 .517 - .417
9 .900 .850 .800 .750 .700 .650 .600 .550 -
.875 .870 .815 .760 .695 .605 .600 .575 -
89.9 .891 .861 .791 .745 .705 .641 .588 .575 -
94.3 .861 .841 .788 .751 .716 .655 .600 .583 -
==================================================================
For each player, the four rows show (i) the true expectation against each
opponent; (ii) the average result of a hundred games against each component,
simulated by computer; (iii) the grade calculated from the simulated games, using
the correct expectation function, and the resulting expectations against each
opponent; and (iv) the same using an incorrect expectation function as described
in the text.
This is excellent news for grading secretaries, since it suggests that any
reasonable expectation function can be used; the spacing of grades may differ
from that which a correct expectation function would have generated, but the
expectations will be adjusted in approximate compensation, and any residual
errors will be small compared with the effects of chance fluctuation on the
actual results. But there is an obvious corollary: the apparently
successful calculation of expectations by a grading system throws no real light
on the underlying nature of the game. Chess grades are currently calculated
using a system, due to A. E. Elo, in which expectations are calculated by the
normal distribution function, and the general acceptance of this system by chess
players has fostered the belief that the normal distribution provides the most
appropriate expectation for chess. In fact it is by no means obvious that this
is so. The normal distribution function is not a magic formula of universal
applicability; its validity as an estimator of unknown chance effects depends on
the Central Limit Theorem, which states that the sum of a large
number of independent samples from the same distribution can be
regarded as a sample from a normal distribution, and it can reasonably be adopted
as a model for the behavior only if the chance factors affecting the result are
equivalent to a large number of independent events which combine additively.
Chess may well not satisfy this condition, since many a game appears to be
decided not by an accumulation of small blunders but by a few large ones. But
while the question is of some theoretical interest, it hardly matters from the
viewpoint of practical grading. Chess gradings are of greatest interest at
master level, and the great majority of games at this level are played within an
expectation range of 0.3 to 0.7. Over this range, the normal distribution is
almost linear, but so is any simple alternative candidate, and so in all
probability is the unknown 'true' function which most closely approximates to the
actual behaviour of the game. In such circumstances, the errors resulting from
an incorrect choice of expectation function are likely to be even smaller than
those which appear in Table 5.1.
The limitations of grading
--------------------------
Grades help tournament organizers to group players of approximately equal
strength, and they provide the appropriate authorities with a convenient basis
for the awarding of honorific titles such as 'master' and 'grandmaster'. However,
it is very easy to become drunk with figures, and it is appropriate that this
discussion should end with some cautionary remarks.
(a) Grades calculated from only a few results are unlikely to be reliable.
(b) The assumption underlying all grades is that a player's performance against
one opponent casts light on his expectation against another. If this assumption
is unjustified, no amount of mathematical sophistication will provide a remedy.
In particular, a grade calculated only from results against much weaker opponents
is unlikely to place a player accurately among his peers.
(c) There are circumstances in which grades are virtually meaningless. For an
artificial but instructive example, suppose that we have a set of players in
London and another in Moscow. If we try to calculate grades embracing both sets,
the placing of players within each set may well be determined, but the placing of
the sets as a whole will depend on the results of the few games between players
in different cities. Furthermore, these games are likely to have been between
the leading players in each city, and little can be inferred from them about the
relative abilities of more modest performers. Grading administrators are well
aware of these problems and refrain from publishing composite lists in such
circumstances, but players sometimes try to make inferences by combining lists
which administrators have been careful to keep separate.
(d) A grade is merely a general measure of a player's performance relative to
that of certain other players over a particular period. It is not an
absolute measure of anything at all. The average ability of a pool of
players is always changing, through study, practice, and ageing, but grading
provides no mechanism by which the average grade can be made to reflect these
changes; indeed, if the pool of players remains constant and every game causes
equal and opposite changes to the grades of the affected players, the average
grade never changes at all. What does change the average grade of a pool is the
arrival and departure of players, and if a player has a different grade when he
leaves than he received when he arrived then his sojourn will have disturbed the
average grade of the other players; but this change is merely an artificial
consequence of the grading calculations, and it does not represent any change in
average ability. It is of course open to a grading administrator to adjust the
average grade of his pool to conform to any overall change in ability which he
believes to have occurred, but the absence of an external standard of comparison
means that any such adjustment is conjectural.
It is this last limitation that is most frequently overlooked. Students of all
games like to imagine how players of different periods would have compared with
each other, and long-term grading has been hailed as providing an answer. This
is wishful thinking. Grades may appear to be pure numbers, but they are actually
measures relative to ill-defined and changing reference levels, and they cannot
answer questions about the relative abilities of player when the reference levels
are not the same. The absolute level represented by a particular whether a
player's grade ten years before his peak can properly be compared with that ten
years after, and quite certain that his peak cannot be compared with somebody
else's peak in a different era altogether. Morphy in 1857-8 and Fischer in
1970-2 were outstanding among their chess contemporaries, and it is natural to
speculate how they would have fared against each other; but such speculations are
not answered by calculating grades through chains of intermediaries spanning over
a hundred years.\fn{4}
\fn{4} Chess enthusiasts may be surprised that the name of Elo has not figured
more prominently in this discussion, since the Elo rating system has been in use
internationally since 1970. However, Elo's work as described in his book
The rating of chessplayers, past and present (Batsford 1978) is
open to serious criticism. His statistical testing is unsatisfactory to the
point of being meaningless; he calculates standard deviations without allowing
for draws, he does not always appear to allow for the extent to which his tests
have contributed to the ratings which they purport to be testing, and he fails to
make the important distinction between proving a proposition true and merely
failing to prove it false. In particular, an analysis of 4795 games from
Milwaukee Open tournaments, which he represents as demonstrating the normal
distribution function to be the appropriate expectation function for chess, is
actually no more than an incorrect analysis of the variation within his data. He
also appears not to realize that changes in the overall strength of a pool cannot
be detected, and that his 'deflation control', which claims to stabilize the
implied reference level, is a delusion. Administrators of other sports (for
example tennis) currently publish only rankings. The limitations of those are
obvious, but at least they do not encourage illusory comparisons between today's
champions with those of the past.