A New View of Statistics
The group you select doesn't have to be the bottom or top half, and the test doesn't have to be the first one. Any group or even any subject you choose with an average score below or above the mean of all the subjects in a given test will probably move (regress) noticeably closer to the mean in another test. In general the scores don't move completely to the mean–they just get closer to it. It is therefore more accurate to call the phenomenon regression towards the mean.
OK, so low scorers tend to get better on retest, and high scorers tend to get worse? Well, no, actually. Depending on the nature of your data, the change in the scores towards the mean may be partly or even entirely a statistical artifact. If it's entirely an artifact, the true scores of the subjects don't really change on retest–it just looks that way. When that happens in, for example, a training study, your analysis might lead you to conclude that the least fit subjects got a big benefit from the training, whereas the fittest subjects got a smaller benefit or may even have got worse. In reality, all subjects may have increased in fitness by a similar amount, regardless of initial fitness. Your conclusion about the effect of initial fitness could be artifactual garbage.
Regression to the mean can lead to similar mistakes with repeated observation or testing of the health or performance of an individual. Consider a patient with a chronic health problem. Depending on the problem, symptoms can fluctuate in severity over a period of weeks or months, for no apparent reason. When the symptoms get really bad, the patient may try a new alternative therapy. The symptoms then improve, because they were bound to improve from their atypical severe level. The patient can be forgiven for thinking that the new therapy worked. Later on, the patient stops taking the new therapy, the symptoms get bad again, the patient takes the therapy again, the symptoms improve... Get the picture? You can imagine a similar scenario with an athlete who turns in a particularly bad performance, then does something about it. Whatever the athlete does, it's likely to work–artifactually. Now you can understand why there is so much snake oil on the shelves of drug stores.
I'll now deal with the nature of artifact when you analyze data from a group of individuals. The subsections are: the cause of the artifact, the magnitude of the artifact, and how to avoid the artifact.
The noise responsible for regression to the mean can come from two sources: the measuring instrument (technical or technological noise) and the subjects themselves (within-subject variation from test to test). I use the word instrument in its most generic sense: it could be a questionnaire, a device for measuring oxygen consumption, or whatever. If the noise comes solely from the instrument, regression to the mean is unquestionably an artifact. But if the noise is due to within-subject variation, there is a sense in which the regression to the mean is real. I'll explain with an example.
Suppose you administer two fitness tests several months apart. Several months is long enough for many subjects to change their fitness substantially: some will be fitter, some less fit. The "noise" in the test could be due almost entirely to these random but real within-subject changes in fitness. So when you select a subgroup with low fitness scores in the first test, the increase in their fitness in the second test is a real increase. If the increase is real, is there still a problem? Yes, because you could easily attribute the increase in fitness to something you had done between the tests, such as a training or nutritional intervention. The increase in fitness is real, but some of it was going to happen anyway, regardless of whatever you did. There are many papers in the literature in which the authors did not take account of regression to the mean when they claimed that their treatment produced a bigger increase in fitness on subjects with lower initial fitness.
The retest correlation is involved in regression to the mean, because the correlation is a measure of the magnitude of the noise in the measurement. The formula for r is (SD2 – sd2)/SD2, where sd is the within-subject standard deviation (the typical or standard error of measurement, or the noise) and SD is the usual between-subject standard deviation in either test. Rearranging, 1 – r = the fractional shift towards the mean = sd2/SD2. If sd is small relative to SD, there is little regression to the mean. At the other extreme, when SD = sd, subjects are effectively identical (the only difference between subjects is noise), so all pre-selected scores that differ from the mean will, on average, regress completely to the mean on retest.
The above formulae will allow you to estimate how much of a change in the mean is artifactual, but you should also be concerned about precision of the estimate, that is, the confidence limits for the true value. Bill Trochim does not have a formula for the confidence limits for the adjusted change in the mean. In the next section I will explain how to use the formula and get confidence limits.
the Change Scores
To use this approach, you will need to know either the retest correlation coefficient (r) or the within-subject variation (standard deviation, sd) for the dependent variable. Both must come from a reliability study with subjects and time between tests similar to those in your study. In my experience, an appropriate reliability study is often not available in the literature, so you will have to guestimate the reliability from less applicable reliability studies. Guestimate an sd rather than an r, because r is sensitive to the between-subject standard deviation of the subjects in the reliability study.
Armed with the reliability sd or r, proceed as follows. Subtract the pre-test mean of all subjects from each subject's pre-test score. Multiply that difference either by sd2/SD2 or by (1 - r), where SD is the usual between-subject standard deviation of your subjects in the pre-test. Now add the result (or subtract it when it is negative) to the post-pre change score for that subject. This corrected change score is free of the artifact. Use it in your analyses just as you would any change score. For example, do an unpaired t test to compare subjects with low vs high pre-test scores. Better still, plot the corrected change scores on the Y axis against the pre-test scores on the X axis. If the points form something like a line, derive the slope of the line as an estimate of the effect of pre-test score on the effect of the treatment.
Be aware that the confidence interval (or p value) for any effects involving the adjusted change score will be too small if the reliability study had a small sample size, owing to uncertainty in the estimate of sd or r. The effects, such as the difference between high and low scorers or the slope of the line in the examples above, will also be biased if the r or sd from the reliability study are substantially different from what your subjects would show in a reliability study with the same time between tests as in your study.
a Control Group
Using a control group is a better approach than correcting the change score. Actually, the approaches are fundamentally the same, because the control group is effectively the most appropriate reliability study for correcting the change scores. But don't use the control group to correct each subject's change score. Instead, analyze the effect of the pre-test score on the change score in both groups in the same manner, then compare the effect in the treatment group with that in the control group. The analysis will require a two-way analysis of variance (ANOVA) or covariance (ANCOVA). For example, suppose Ychng is the dependent variable representing each subject's post-pre change score, suppose Group has levels control and intervention, and suppose Prescore represents the pre-test score. The model is:
Ychng <= Group Prescore Group*Prescore.
If Prescore has the numeric values of the pre-test score, the model represents an ANCOVA. If instead you have coded the pre-test scores into two levels, such as low and high, the model is a 2-way ANOVA. Not that it matters what you call it--either way, you are interested only in the interaction term Group*Prescore, which yields the difference between the groups in the effect of the pre-test score on the change score (that is, on the effect of the treatment).
Watch out for non-uniform error! The standard deviation of the
change scores in the treatment group may be larger than that in the control
group, and there may be differences in the standard deviation for different
values of Prescore, when there is a substantial true effect of pre-test score
on the change score. The only way to take such non-uniform error into account
properly is to use mixed modeling to specify different error terms for the different
groups. Sorry, that's the way it is, guys. It's time you upskilled to the mixed
Go to: Next · Previous · Contents · Search · Home
Last updated 26 June 06