A New View of Statistics 

Typical Error
The values of the change score or
difference score for each subject yield the typical error. Simply
divide the standard deviation of the difference score by root2. For
example, if the difference scores are 5, 2, 6, 0, and 3, the
standard deviation of these scores is 4.1, so the typical error is
4.1/root2 = 2.9. This method for calculating the typical error
follows from the fact that the variance of the difference score
(s^{2}_{diff}) is equal to the
sum of the variances representing the typical error (s) in each
trial: s^{2}_{diff} =
s^{2} +
s^{2}, so s =
s_{diff}/root2.
To derive this withinsubject variation as a coefficient
of variation (CV), logtransform your variable,
then do the same calculations as above. The CV is derived from the typical error
(s) of the logtransformed variable via the following formula:
CV = 100(e^{s}
 1),
which simplifies to 100s for s<0.05 (that is,
CVs of less than 5%). You will also meet this formula on the page about logtransformation,
where I describe how to represent the standard deviation of a variable that
need log transformation to make it normally distributed. As I describe on that
page, I find it easier to interpret the standard deviation and shifts in the
mean if I make the log transformation 100x the log of the variable. That way
the typical error and shifts in the mean are already approximately percents.
To convert them to exact percents, the formula becomes 100(e^{s/100}
 1).
We sometimes show the typical error with a ± sign in front of it, to indicate that a subject's observed value varies by typically ± the typical error whenever we measure it. For example, the typical error in a monthly measurement of body mass might be ±1.5 kg. When we express the typical error as a CV, we can also think of it as ±2.1% (if the subject weighed 70 kg), but strictly speaking it's more appropriate to show the variation as ×/÷1.021. In other words, from month to month the body mass is typically high by a factor of 1.021 or low by a factor of 1/1.021. These factors come from the assumption that the logtransformed weight rather than the weight itself is normally distributed. Now, ×1.021 is the same as 1 + 0.021, and 1/1.021 is almost exactly 1  0.021, so it's OK to show the CV as ±2.1%. But when the CV is bigger than 5% or so, the use of the minus sign gets more inaccurate. For example, if the CV is 35%, the value of the variable varies typically by a factor of 1.35 to 1/1.35, or 1.35 to 0.74, or 1 + 0.35 to 1  0.26, which is certainly not the same as 1 + 35% to 1  35%. You can still write ±35%, but be aware that the implied typical variation in the observed value is ×/÷1.35.
Changes in the Mean
A simple way to get these is to do paired
t tests between the pairs of trials. Do it on the logtransformed variable
and you'll get approximate percent changes in the mean between trials. Use the
same formulae as for the CV to turn these into exact percent changes.
Retest Correlation
A simple Pearson correlation is near enough.
If the variable is closer to normally distributed after log transformation,
you should use the correlation derived from the logtransformed variable. Alternatively
calculate the intraclass correlation coefficient from the formula ICC = (SD^{2}  sd^{2})/SD^{2},
where SD is the betweensubject standard deviation and sd is the withinsubject
standard deviation (the typical or standard error of measurement). These standard
devations can come from different subjects, if you want to estimate the retest
correlation by combining the error in one study applied to a different group.
The spreadsheet for the ICC has this formula
and confidence limits for the ICC.
Note that the above relationship allows you to calculate the typical error from a retest correlation, when you also know the betweensubject standard deviation: sd = SD·root(1  r). Strictly speaking the r should be the intraclass correlation, but there is so little difference between the Pearson and the ICC, even for as few as 10 subjects, that it doesn't matter.
There are more complicated procedures for getting the average reliability, using ANOVA or repeatedmeasures analyses. There is no spreadsheet for these procedures. I'll describe the usual approach, which is based on the assumption that there is a single random error of measurement that is the same for every subject for every trial. That is, whenever you take a measurement, a random number comes out of a hat and gets added to the true value. The numbers in the hat have a mean of zero, and their standard deviation is the error of measurement that you want to estimate. Or to put it another way, no matter which pairs of trials you select for analysis, either consecutive (e.g., 2+3) or otherwise (e.g., 1+4), you would expect to get the same error of measurement. This assumption may not be particularly realistic, if, for example, you did 5 trials each one week apart: the error of measurement between the first and last trial is likely to be greater than between trials closer together. If you estimate the error assuming it is the same, you will get something that is too large for trials close together and too small for trials further apart.
To understand this section properly, read the pages on statistical modeling. In a reliability study or analysis, you are asking this question: how well does the identity of a subject predict the value of the dependent variable, when you take into account any shift in the mean between tests? (If the variable is reliable, the value of the variable is predicted well from subject to subject. If the variable is unreliable, it isn't much help to know who the subject is.) So the model is simply:
dependent variable <= subject test
In other words, it's a twoway analysis of variance (ANOVA) of your variable with subject and test as the two effects. Do NOT include the interaction term in the model! The analysis is not done as a repeatedmeasures ANOVA, because the subject term is included in the model explicitly. Experts with the Statistical Analysis System can use a repeatedmeasures approach with mixed modeling, as described below in modeling variances.
Typical Error
The root meansquare error (RMSE) in the ANOVA
is a standard deviation that represents the withinsubject variation from test
to test, averaged over all subjects. If your stats package doesn't provide confidence
limits for it, use the spreadsheet for confidence
limits.
If you use a oneway ANOVA in which the only effect is subject, the RMSE will be contaminated by any change in the mean between trials. (In a twoway ANOVA, the test effect takes out any change in the mean.) The resulting RMSE represents the total error of measurement. You can also derive the total error by calculating each subject's standard deviation, squaring them, averaging them over all subjects, then taking the square root. This procedure works for two trials, too. I don't recommend total error as a measure of reliability, because you don't know how much of the total error is due to change in the mean and how much is due to typical error.
Changes in the Mean
Your stats program should be able
to give you confidence limits or p values for each consecutive
pairwise comparison of means. If it gives you only the p values,
convert these to confidence limits using the spreadsheet
for confidence limits.
Shifts in the mean and typical error as percents are derived from analysis of the logtransformed variable. See the previous section for the formula.
Retest Correlation
Scrutinize the output from the ANOVA and find
something called the F value for the subject term. The retest correlation, calculated
as an intraclass correlation coefficient (ICC), is derived from this F value:
ICC = (F  1)/(F + k  1),
where k = (number of observations  number of tests)/(number of subjects  1). In the case of no missing values, number of observations = (number of tests)·(number of subjects), so k is simply the number of tests. For example, a reliability study of gymnastic skill consisted of 3 tests on 10 subjects. There were 28 observations instead of 30, because two athletes missed a test each, so k = (283)/(101) = 2.78. The F ratio for subjects was 56. Reliability was therefore (561)/(56+2.781) = 0.95.
I used to have this formula in the spreadsheet for confidence limits, then I removed it for many years, thinking that people don't need it. Recently (2009) I've started expressing predictability of competitive athletic performance as an ICC, and I found I do need it and related formulae. So they're back, in their own spreadsheet for the ICC.
The ICC formula came from Bartko (1966), although he used sums of squares rather than F values. His formula for k when there are missing values is complex and appears not to be the same as the one I have given above. The random statement in Proc Glm of the Statistical Analysis System generates k, and I have found by trial and error that my formula gives the exact value.
Your stats program will give you p value for the subject term and the test term. The p value for subject is not much use. It tells you whether the ICC is statistically significantly different from zero, but that's usually irrelevant. The ICC is usually at 0.70.9 or more, so there's no way it could be zero. More important are the confidence limits for the ICC and for the typical error. The p value for test addresses the issue of overall differences between the means of the tests, but with more than two tests you should pay more attention to the significance of consecutive pairwise differences (to see where any learning effects fade out). I'd prefer you to show the confidence intervals for the differences, rather than the p values. If your stats program doesn't give confidence intervals, use the spreadsheet for confidence limits for the typical error, and the spreadsheet for the ICC for confidence limits for the ICC. By the way, stats programs don't provide a p value for the typical error, because there's no way it can be zero.
The typical error or root mean square error (RMSE) from one group of subjects can be combined with the betweensubject standard deviation (SD) of a second group to give the reliability correlation for the second group. This approach is handy if you do repeated testing on only a few subjects to get the withinsubject variation, but you want to see how that translates into a reliability correlation when you combine it with the SD from single tests on a lot more subjects. You simply assume that the withinsubject variation is the same for both groups, then apply the formula that defines the reliability correlation:
ICC = (SD^{2}  typical error^{2})/SD^{2}.
(This formula can be derived simply enough from the definition of correlation as the covariance of two variables divided by the product of their standard deviations.) The spreadsheet for the ICC deals with this scenario, too.
For nonnormal variables, your analyses in the main study are likely to be nonparametric. So it makes sense to derive a nonparametric reliability. Just do the ANOVA on the ranktransformed variable. The withinsubject variation is hard to interpret, though.
Attention sport psychologists: if the repeated "tests" are simply the items of an inventory, the alpha reliability of the items (i.e., the consistency of the mean of the items) is (F  1)/F.
For nominal variables (variables with categories as values rather than numbers), the equivalent of the ICC is the kappa coefficient. Your stats program should offer this option in the output for the procedure that does chisquared tests or contingency tables.
Depending on the sample and the variable, you should also analyze the typical errors for subgroups differing in sex, athletic status, age group, and so on. You sometimes find that any differences in reliability between such groups arise mainly from differences in the magnitude of the variable; for example, if log transformation removes any nonuniformity of error related to the magnitude of the variable, you will probably find that the subgroups for sex, age or whatever now have the same percent typical errors.
A more statistical approach to checking for differences in the typical error between subjects is to look at the scatter of points in the plot of the two trials. The scatter at right angles to the line of identity should be the same wherever you are on the line (and for whatever subgroups). If there is more scatter at one end, the subjects at that end have a bigger typical error. It's often difficult to tell whether the scatter is uniform on the plot, especially when reliability is high, because the points are all too close to the line. An easier way is to plot the change score against the average of the two trials for each subject. I have provided such a plot on the spreadsheet. (It's not obvious even on this plot that the subjects with bigger skinfolds have more variability. Again, more subjects are needed.) I've also provided a complete analysis for the logtransformed variable. A uniform scatter of the change scores after logtransformation implies that the coefficient of variation (CV, or percent typical error) is the same for all subjects, and the analysis of the logtransformed variable provides the best estimate. Look at the plots of the difference scores and you will see that the scatter is perhaps a little more uniform after log transformation. When I analyzed the bottom and top halves of the logtransformed variable, I got CVs of 1.1% and 2.0%. These CVs are a little closer together than their corresponding raw typical errors, so it would be better to represent the mean typical error for the full sample as 1.7% rather than 0.83 mm. But really, you need more subjects...
When you analyze three or more trials using ANOVA or repeated measures, the equivalent of the difference scores is the residuals in the analysis, and the equivalent of the average of the two trials is the predicted values. The standard deviation of the residuals is the typical error, so if the residuals are bigger for some subjects (some predicteds), the typical error is bigger for those subjects. Try to coax your stats program into producing a plot of the residuals vs the predicteds. Click for more information about residuals and predicteds, and about bad residuals (heteroscedasticity).
The Pearson correlation coefficient is also a biased estimate of retest correlation: it is biased high for small sample sizes. For example, with only two subjects you always get a correlation of 1! For samples of 15 or more subjects, the ICC and the Pearson do not usually differ in the first two decimal places.
I used to think that limits of agreement were biased high for small samples, because I thought they were defined as the 95% confidence limits for a subject's change between trials. (The formula for confidence limits includes the t statistic, which is affected by sample size in such a way that the limits defined in this way would be biased high for small samples.) But apparently Bland and Altman, the progenitors of limits of agreement, did not define limits of agreement as 95% confidence limits; instead they defined them as a "reference range", generated by multiplying the typical error by 2.77, regardless of the size of the sample that is used to estimate the typical error. In other words, the limits of agreement represent 95% confidence limits for a subject's true change only if the typical error is derived from a large sample. With this definition, the limits of agreement are only as biased as the typical error.
Surprisingly, even the typical error is biased! Yes, the square of the typical error (a variance) is unbiased, so the square root of the variance must be biased low for small samples. In practical terms, typical errors derived from samples of, say, 10 subjects tested twice will look a bit smaller on average than typical errors derived from hundreds of subjects or many retests. This bias in the typical error does not affect any statistical computations involving the typical error.
The spreadsheet now includes averages for the consecutive pairwise estimates of error, with confidence limits. This approach to combining more than two trials is probably more appropriate than the usual analysis of variance or repeatedmeasures analysis that I describe above (and which, in any case, I can't set up easily on a spreadsheet). I have also included averages of trial means and standard deviations, in case you want to report these as characteristics of your subjects.

See also the spreadsheet for the ICC, when you have between and withinsubject standard deviations and you want the ICC and its confidence limits, or you have the ICC and you want its confidence limits, or you have an F ratio from an ANOVA and you want the ICC and its confidence limits.