A New View of Statistics 

The important point in Point 1 is to make sure the conditions and subjects in the reliability study are similar to those in the intended experiment. In particular, the time between consecutive pairs of trials in the reliability study should be similar to the time between the pre and post tests in the experiment. For example, if you intend to look at the effects of a twomonth nutritional intervention on body fat, the reliability of body fat measurements with two months between measurements will give you a more realistic idea of sample size than the higher reliability you are likely to see if only two hours separate the measurements. With two hours between measurements, the typical error is likely to arise only from technological error: error arising from the apparatus or the operator of the apparatus. With two months between measurements, some subjects will get fatter and some will get thinner, so the typical error will include also biological "error": true variation within the subjects. It's against this background of biological variation and technological error that you measure changes in body fat resulting from your intervention.
Researchers don't devote enough attention to Point 2. I go into this point in detail on several pages: a scale of magnitudes, and formulae for sample size. In summary, for most studies of health, injury, and fitness of normal folks, the smallest effect is 0.2 of the betweensubject standard deviation. For studies of athletic performance, the smallest effect is 0.30.5 of the typical variation (standard deviation) that a top athlete displays from competition to competition.
If only all our measures were as good as those in Point 3! The "modest" sample size based on adequate confidence limits is ~8s^{2}/d^{2} for a crossover, or 4x as many when there is a control group, where s is the noise (typical error or withinsubject standard deviation) and d is the signal (smallest worthwhile change). So, when the noise in the measure is negligible compared with the smallest effect (s<<d), you can in theory do the experiment with one subject in a crossover and two in a controlled trial (one each in the treatment and control groups). But you should still use ~10 subjects, to be confident that the subjects in your study are representative of a wider population.
Most often the noise is greater than the smallest signal, as in Point 4. The noise comes either from technological error, or from random real changes in the subjects over the time frame of the study, or from individual responses to the treatment. Whatever the source of the noise, acceptable precision for the smallest effect demands either a large sample size (>>8 in a crossover; >>32 in a controlled trial) or several pre and post tests on each subject. Extra pre tests and post tests effectively reduce the noise in the measure, because you analyze the change between the average of the pre tests and the average of the post tests. It's the only option when your pool of subjects is limited.
It's certainly a good idea to do a reliability study before an experiment, either to estimate sample size or to make sure you've got your techniques right. But if you are reasonably confident about the techniques, I advocate getting stuck straight into the experiment. As I explain in sample size on the fly, if your treatment turns out to have a big effect, you needn't have done all the extra testing to get adequate precision.
When individual responses are present, subjects show a greater variability in the postpre difference score. Analysis of the experimental group as a reliability study therefore yields an estimate of the typical error inflated by individual responses. Comparison of this inflated typical error with the typical error of the control group or with the typical error from a reliability study allows you to estimate the magnitude of the individual responses as a standard deviation (2.5% in the above example). If the experiment consists of a pretest, an intervention, and a posttest, the estimate is readily derived from basic statistical principles as root(2s^{2}_{expt}  2s^{2}), where s_{expt} is the inflated typical error in the experimental group, and s is the typical error in the control group or in a reliability study. For example, if the typical error in the experimental group is 2%, and the typical error in the control group or in a reliability study is 1%, the standard deviation of the individual responses is 2.5% (= root6).
If you use the typical error from a reliability study to estimate the individual responses in your experiment, make sure the reliability study has a time frame and subjects similar to those in your experiment. And if your experiment is a crossover, there is no control group, so you have to use the typical error from a reliability study. Alternatively, use a complex crossover in which your subjects do several tests for each of the treatments.
You can also used mixed modeling to estimate individual responses. It's awfully complicated, but the extra effort is worth it, because you also get confidence limits for the estimate. When individual responses are present, the obvious next step is to identify the subject characteristics that predict the individual responses. The appropriate analysis is repeatedmeasures analysis of covariance, with the likely subject characteristics (e.g., age, sex, fitness, genotype) as covariates. Follow this link for more.
Typical Error for Assessing Individuals
When you wanted the sample
size for an experiment, it was important to use an estimate of reliability
from a reliability study with the same time between trials as in the experiment.
But for a single measurement or a change in a measurement on an individual,
you need an estimate of reliability with the minimum of biological variation.
The period between measurements in the reliability study therefore needs to
be brief. By brief I mean a period over which you wouldn't expect any
real change in the variable you are assessing. For retests of skinfolds, brief
could be an houranything longer and changes in the subject's posture or state
of hydration might affect the measurements. For retests of physical performance,
leave just enough time for all your subjects to recover from the fatigue of
the previous test.
If there is a systematic change in the mean in the reliability study, do you take that into account in your subsequent assessments? In general, no, because changes in the mean in the reliability study will usually be due to changes within the subjects.
In what follows, I often refer to the true value of a subject's measurement. By true I mean the value free of typical error, which is the value you would get if you took hundreds of measurements on the subject and averaged them. There might still be a systematic error in this "true" measurement, but you would need to do a validity study to sort that out. That kind of systematic error is less likely to be a problem when you are interested in a change or difference between measurements, because the error will tend to disappear when you subtract one measurement from another.
A Single Measurement
You measure a gymnast and find a sum of seven
skinfolds of 45.2 mm. The true value won't be exactly 45.2 mm, so one
way to take measurement error into account is to specify a likely range or
limits for the true value: a range within which the true value is likely
to fall (for example, 42.2 to 48.2 mm). Likely can be anything we like.
In research projects we usually opt for 95% likely, and later on I devote a
whole page to the concept of confidence limits
for generalizing from a sample to a population. The meaning is much the same
here; the only difference is that we're talking about an individual rather than
an average effect in the population. The 95% confidence or likely limits for
an individual's true value have a 95% chance of enclosing that individual's
true value. Or you can say the odds are 19 to 1 that the subject's true value
will be within the range. You get 95% limits by multiplying the typical error
by about ±2.0. Let's say your typical error
is 1.5 mm for the sum of seven skinfolds on a sample of female gymnasts similar
to your subject. The true value of the skinfold sum is therefore 95% likely
to be within 45.2 ± 2.0x1.5, or 42.2 to 48.2
mm.
Do you tell the gymnast the 95% likely range? No, probably not. A certainty of 95% may be OK for research, but it's too much for assessing an individual. The range represented by ±1.0x the typical errora 68% range, or 2 to 1 odds of enclosing the true valueis probably the best default way to convey your uncertainty about the true value. It's certainly the easiest to use! You just say to the gymnast, "the odds are 2 to 1 that your real skinfold thickness is between 45.2 ± 1.5, or 43.7 to 46.7 mm". If you are feeling more cautious, say instead "the odds are 9 to 1 that your real skinfold thickness is between 45.2 ± 2.5, or 42.7 to 47.7 mm." The table below summarizes the likely ranges, the odds, and the factors to multiply by your typical error. You can also use a spreadsheet for precision of a subject's true value.
Factors for generating likely (confidence) limits for the true value of a single measurement or of a difference or change in a measurement. "Likely" is defined by several values of probability or odds. 




























^{a}This factor generates the 95% limits of agreement. 
When the typical error is given as a percent, an approach similar to the above is usually accurate enough. For example, if the typical percent error is 3.0%, the 68% likely range of the true value of a single measurement is ±1.0x3.0 = ±3.0% of the observed value. If you get percent limits of 10% or more, this method become less accurate, so you have to use log transformation. But don't worry, it's all taken care of in the spreadsheet.
The factors shown in the table are values of the t statistic for the given probability. The factors get a bit larger for typical errors based on smaller sample sizes, reflecting more uncertainty about the magnitude of the typical error from smaller samples. For 20 subjects measured twice, the factors are accurate enough. If you assess subjects frequently, you should estimate the typical error of your measurement from a larger amount of retestingotherwise you're likely to mislead all your subjects about the accuracy of their assessments through using an estimate of typical error that is much higher or much lower than the true typical error. See below for more on this issue.
The other way to take error into account when you assess a subject is to specify the likelihood (probability or odds) that the subject's true value is greater than (or less than) a reference value. This method is better for changes in a measurement between tests, but I'll illustrate it here with a simple example. If a skinfold thickness of 42 mm or more had some special significance, you could say to the gymnast "there's a 98% chance that your skinfolds are thicker than 42 mm", or "odds are 50 to 1 that your skinfolds are thicker than 42 mm". The probability and odds come straight from the first example shown on the spreadsheet.
Monitoring for
a Change between Measurements
The uncertainty in a change between measurements
is more than in a single measurement, because a change involves two measurements,
each of which has error. But you double the variance, not the typical error,
so the typical error in a change score is root2 times the typical error. The
likely limits for a change in a measurement are therefore root2 times the limits
for a single measurement. See the table above for the factors corresponding
to the different likelihoods. I have incorporated these factors into the spreadsheet.
For an example, let's measure our gymnast again, one month later. Her skinfolds
were 45.2, but now they're 48.5 mm. The coach wants to know if she is really
putting on fat. What do you tell the coach?
First, let's try likely limits. As before, let's assume the typical error is 1.5 mm. The easiest likely limits to calculate for a change score are the 50% limits: simply plus or minus the typical error. The observed change is 3.3 mm, so you'd say there's a 50% chance, or odds of 1:1, that the true change is between 3.31.5 and 3.3+1.5, or 1.8 and 4.8 mm. If we opt for a range that has odds of 4:1 of including the true change (an 80% likely range), the limits are 3.3 ±1.81x1.5, or 0.6 and 6.0 mm. And so on. Fine, but what percent limits should you use in these practical situations, and how do you use them to decide whether a real change has occurred? Rather than try to answer these hard questions, I will take you through a better method of assessing change.
The better method is based on calculating the likelihood that the true change is bigger than a reference value. For the reference value, you choose the smallest clinically important or worthwhile change. In the above example (observed increase of 3.3 mm, typical error of 1.5 mm), let's say that an increase in skinfolds of 2.0 mm is the smallest change worth worrying about. Obviously, the gymnast's observed change of 3.3 mm is already more than 2 mm, but how likely is it that the true change is more than 2 mm? From the spreadsheet, the likelihood is 73%, or odds of 3 to 1. We should also work out the likelihood that the gymnast's skinfolds have actually decreased (even though we observed an increase). The smallest worthwhile decrease would be 2.0. From the spreadsheet, the chance that the decrease has been greater than 2.0 (< 2.0) is only 1%, or odds of 1:136. Your advice to the coach? "Odds are 3 to 1 there's been a substantial increase in skinfold thickness, and there's a negligible chance that her skinfolds have decreased. You can assume she's fatter."
This example is reasonably clear cut, mainly because the typical error or noise (1.5 mm) is somewhat less than the smallest important change (2 mm). Basically, our measure is precise relative to any changes that matter, so any changes we observe with such a measure are trustworthy. But what if the noise is about equal to the smallest signal? The Powerpoint presentation has a couple of examples for an arbitrary variable with a typical error of 1.0 and a smallest important effect of 0.9. If the observed effect is 1.5, chances are 66% the true effect is clinically positive, 29% the true effect is clinically trivial, and 5% the true effect is clinically negative. It's reasonable to conclude the true effect is (probably) clinically positive. If the observed effect is a clinically trivial 0.5, the likelihood that the effect really is trivial is only 45%, whereas there's a 55% chance something really worthwhile has happened (39% positive, 16% negative). You can conclude that maybe nothing has happened, but acting on it would depend on the relative costs and benefits of taking action or doing nothing.
When the typical error is much greater than the smallest worthwhile change, we will often observe clinically worthwhile changes that are due to error of measurement rather than to any real change. The measure is therefore too noisy to be useful. The chances that real positive or negative changes have occurred (using the spreadsheet) confirm this state of affairs. For example, if the typical error is three times the smallest clinically worthwhile change, and we observe the smallest worthwhile change, the chance of a real positive change having occurred is 50%, or odds of 1:1, but the chance of a real negative change having occurred is 32%, or odds of 1:2..
Noisy measures can still be useful for characterizing worthwhile changes smaller than the noise, but we have to reduce the noise by performing multiple pre and post tests; we then either compare means of the pre and post tests or look for a trend across all the tests. On the other hand, observed changes greater than the typical error may still be trustworthy, if you expected them. In the present example, even a change equal to the typical error (three times the smallest worthwhile change) has likelihoods of a true positive value (68% or 2:1) or a true negative value (17% or 1:5) that would satisfy a practitioner who was expecting such a large change in the subject. But if true changes of such large magnitude are unlikely, we should be prepared to discount large observed changes as measurement error.
By basing our assessment partly on the change we think we're likely to see, we are assessing the individual in a Bayesian fashion. Bayesian analysis is a quantitative method for taking into account our prior belief about something, in this case the subject's true value or change in the true value. Experienced clinicians and practitioners adopt this approach qualitatively when they reject unlikely test results. Bayesian analysis ostensibly allows this kind of decisionmaking to be quantitative. But how can we quantify strength of a belief? For example, if we believe a change couldn't be outside ±3, where does the ±3 come from, and what likely limits define couldn't? 80%, 90%, 95%, 99%... ? At the moment I can't see a satisfactory answer to these questions, but whatever, I have included Bayesian adjustment for the likelihoods and likely limits in the spreadsheet. It took me so long to do, I'd hate to think the time was wasted!
Putting all these examples together with lots of deep thought, I came up with the bullet points at the start of this section on assessing an individual. Go back there now, read them again, and make sure you understand and learn them.
Some researchers have tried to use limits of agreement to make decisions about change in an individual. According to these researchers, you can trust an observed change only if it's greater than the limits of agreement. But limits of agreement are so big (2.8 typical errors) that clinically important trustworthy changes often fall within them. You end up having to ignore changes in your subjects that in some settings might be lifethreatening! No, we must abandon limits of agreement as a clinical tool.
Comparing Individuals
All the above calculations for the change
in a single subject's measurements also apply to making decisions about the
difference between two subjects. In the above example, the second measurement
of skinfold thickness (48.5 mm) could have been a measurement of skinfold thickness
of another subject. Your conclusion would be that the second subject has skinfolds
3.3 mm thicker than the first, with odds of 4 to 1 that the real difference
in skinfold thickness is between 0.6 and 6.0 mm. Better still, you could say
that the odds of a real difference in skinfold thickness (more than 2 mm) are
3 to 1.
Spreadsheet
for Assessing an Individual
In this spreadsheet I use the typical error
of measurement and a subject's observed value to estimate likely limits for
the subject's true value and to estimate the likelihood that the subject's true
value is greater than a reference value. I do the same for the change between
two observed values. I also include likelihoods and likely limits for the estimate
of a true criterion value derived from a validity study.
Finally, I've gone to a lot of probably pointless trouble to add Bayesian adjustments
in a second spreadsheet (part of the same file).
Precision of the estimate of a subject's value: Excel latest 

Systematic changes in the mean can be an issue when comparing measures: in general, the bigger the changes between trials, the less desirable the measure. But comparing the typical errors is much more important, because the equipment, protocol, or anthropometrist that produces a measure with less typical error is providing a more accurate measure. Retest correlation contains the typical error, but the fact that it also contains the betweensubject standard deviation makes the comparison of correlations either noisy (when there are different subjects in the two reliability studies) or computationally difficult (when the same subjects are in both studies). Besides, there is no point in comparing retest correlations, if you have already compared typical errors. I therefore will not deal with comparison of retest correlations.
When setting up a study to compare typical errors, keep in mind that the typical error always consists of biological variation arising from the subjects and technological variation arising from the items. The aim is usually to compare the technological variation, so try to make the biological variation as small as possible. For example, when comparing the reliability of two anthropometrists, you would get them to measure the same subjects within an hour, to avoid any substantial biological variation. Similarly, when comparing the reliability of power provided by two ergometers, use athletes as subjects, because they are usually more reliable than nonathletes.
Comparing the reliability of two items (protocols, equipment, or operators) is straightforward when different subjects are used to get the reliability for each item. Confidence limits for the ratio of the typical errors between corresponding trials in the two groups can be derived from an F ratio. Use Item 4 in the spreadsheet for confidence limits for this purpose. To compare changes in the mean between corresponding pairs of trials for the two measures, you will need to use an unpaired t test of the change scores. Using the same subjects has more power but requires analysis by an expert. (The analysis needs a mixed model, in which the equipment is a fixed effect, trial number is a fixed effect, subjects is a random effect, and a dummy random variable is introduced to account for the extra withinsubject variance associated with measures on one of the items. Confidence limits for the extra variance tell you how different the typical errors could be. The model also provides an estimate of the difference in changes in the mean between the items, or you can use a paired t test.)
In the previous section I said that the 95% likely range is too conservative for assessing individuals, and I also said that it's difficult to decide on what percent range to use. The same argument and difficulty applies for comparison of typical errors in a clinical or field setting. It won't hurt to calculate, say, the 80% likely range, but I think clinicians and practitioners (and you!) will have a better chance of understanding what I'm getting at if you use likelihood that one typical error is substantially smaller or larger than the other. You compare typical errors by dividing one by the other, to get a ratio. A ratio of 1.1 or maybe 1.2 is my best guess at the minimum worthwhile difference in reliabilities, so you calculate the likelihood (as a probability or odds ratio) that one measure has a typical error at least 1.1x (or 1.2x) bigger than the other. It's all on the spreadsheet for confidence limits.
Phew! Let's see what sample size we'll need for the estimate of reliability for each application of reliability. The applications are: estimating sample size for an experiment, comparing reliability of different measures, estimating individual responses in an experiment, and assessing an individual. We'll assume modest reliability: a typical error of the same order of magnitude as the smallest change that matters to subjects. We'll find that samples of 50 subjects tested three times gives reasonable precision for the estimate of the typical error. That's assuming you can combine the data for all three trials to estimate the typical error. If there is a substantial learning or practice effect on the typical error between the first and second trials, you will need another trialfour in totalso you can combine the last three.
Sample Size for Reliability
Studies...
...for Estimation of Sample Size for an Experiment
When you use a value of the typical error
to estimate the sample size for an experiment, uncertainty
in the typical error translates into uncertainty in the sample size you will
need for the experiment. Sample size for an experiment is inversely proportional
to the square of the typical error, so uncertainty in the typical error balloons
into much bigger uncertainty in sample size for an experiment. You can check
the effect of number of subjects and retests on precision of the typical error
by plugging numbers into the appropriate cells of Item 3 on the spreadsheet
for confidence limits. Give the typical error a value of 1.0 then pretend
you got this value from a reliability study of either 10 subjects tested twice
(= 9 degrees of freedom). You will find that the 95% confidence limits for the
true typical error are 0.69 to 1.83; square these and you get the uncertainty
in sample size as factors of 0.47 to 3.33. In other words, if you predicted
a sample size of, say, 40 subjects in the experiment on the basis of a typical
error of 1.0, what you might really need is anything from 19 to 133. Well, that's
far too wide a range! Let's try a reliability study with 50 subjects tested
three times. The range in sample size becomes 31 to 54, which is still quite
a lot of uncertainty, but I guess it's OK.
This calculation is based on 95% limits of uncertainty for the typical error, which may be a bit too conservative for the likely limits of the sample size in the experiment. If instead we use 67% likely limits, we end up with something more like the typical variation in the estimate of sample size based on the reliability study. For a reliability study of 10 subjects tested twice, the typical variation in our estimate of sample size would be, for example, 28 to 72. Still too wide. Test them three times and you get 30 to 59. That's better, but the required sample size could easily be outside these 67% limits.
So what's my advice? If you have the time, money, and subjects for a large reliability study, go for it. Otherwise you're better off devoting your resources to the experiment by using sample size on the fly: stop testing subjects when you have adequate precision for the effect.
Sample Size for Reliability
Studies...
...for Comparing Reliability of Measures
When you want to compare the reliability of
two measures, the worstcase scenario is that you observe similar reliabilities
for the two measures. (You might see why this is worstcase in a minute.) In
this scenario, you want to conclude that there are no substantial differences
between the measures. The easiest way to compare typical errors is compute their
ratio and its confidence limits. Therefore, you will be able to conclude there
is no substantial difference if the upper limit of the ratio is only a little
greater than 1.00 and the lower limit is only a little less than 1.00. Let's
generate some confidence limits for the ratio using Item 4 (ratio of standard
deviations) in the spreadsheet for confidence
limits. Make the two typical errors the same (e.g. 3.0), and pretend each
has come from a study with 100 degrees of freedom (51 subjects, 3 trials). You'll
see that the 95% confidence limits for the ratio of the typical errors are 0.82
to 1.22. In other words, the true values of the typical errors could be about
20% different from each other. That amount of uncertainty is marginal, in my
view, but again, 95% confidence limits are probably too stringent in a reallife
situation where you are choosing between two items of equipment. The 80% confidence
limits for the ratio are 0.88 to 1.14, which make me feel more comfortable about
concluding that there is no real difference in the reliability of the two measures.
I feel even more comfortable looking at the likelihood the the true value of
the ratio is greater than 1.2: it's only 3%, or odds of 1 in 28. There is no
substantial difference in the reliability of these two measures, if by "substantial"
we mean different by a factor of 1.2 or more.
Things aren't so bad when you observe a big difference between the typical errors of the measures, because you will need less subjects to conclude that one really is substantially worse (larger) than the other. Try it for yourself with the spreadsheet: make the observed typical errors 2.0 and 3.0, give them both only 20 degrees of freedom, make the reference ratio 1.15, say, then look at the likelihood that one typical error is substantially greater than the other: 88%, or odds of 7:1. Not much doubt about itthey're different!
Finally, if you can use the same subjects for both reliability studies, you're bound to get better precision for the ratio and therefore a reduction in sample size required to make firm conclusions about the relative magnitudes of the typical errors. Sorry, I haven't worked out how big the reduction is yet. You can't do it with the spreadsheetyou have to use mixed modeling or bootstrapping.
Sample Size for Reliability
Studies...
...for Estimating Individual Responses
Estimation of individual responses to a treatment
boils down to a comparison of the typical errors of two groups (the treatment
and control groups), so the sample size must be the same as for a comparison
of the reliability of two measures..
Sample Size for Reliability
Studies...
...for Assessing an Individual
At first glance it appears you can use as
few as 20 subjects and two trials to estimate a typical error without substantially
degrading the precision of an individual assessment. Check the spreadsheet
for precision of a subject's true value to see what I mean. In Item 1, put
in an observed value of 50, a typical error of 2.0 from two trials, and compare
the likely limits for the subject's true value when the typical error is based
on 20 subjects vs 2000 subjects. With 20 subjects the 80% likely limits for
the subject's true value are 47.3 to 52.7, or 50 ± 2.7; for 2000 subjects
the limits are 47.4 to 52.6, or 50 ± 2.6. In other words, there's a negligible
increase in the likely limits (= loss of precision) for the smaller sample size.
But wait a minute... the typical error based on a sample of 20 subjects and
two trials is really noisy. Check the spreadsheet
for confidence limits and you'll see, for example, that a typical error
of 2.0 has 95% likely limits of 1.5 to 2.9. That's a big range in precision.
What gives?
Well, 20 subjects and two tests definitely give you almost as much accuracy as a zillion subjects and tests, and that's fair enough if you are assessing only one individual. If another clinician tested 20 subjects twice, then assessed another individual, it would be the same story. But there's likely to be a big difference between your typical errors; for example, yours might be 2.5, and the other clinician's might be 1.7. Your assessments of, say, 80% likely limits based on a typical error of 2.5 would really be ~90% likely limits, while the other person's 80% likely limits based on a typical error of 1.7 would be ~70% likely limits. You're both giving misleading assessments, and so would many other clinicians who tested only 20 subjects twice. Yet averaged over all clinicians and all subjects, the true values of 80% of subjects would be within the likely limits that each clinician tells each subject. The trouble is that your assessments will be consistently misleading, if you are unlucky enough to get a typical error of 2.5 or 1.7 with your batch of 20 subjects. A typical error based on 50 subjects and three tests would usually be in the range of 1.8 to 2.2, and if you used 2.2 in your assessments, your 80% likely limits would be less than 85% limits in reality, which seems OK to me. But I'm still thinking about it...