A Spreadsheet for Analysis of Straightforward Controlled Trials
Will G Hopkins
sportsci.org/jour/03/wghtrials.htm, 2003 (4447 words)
Amongst the various kinds of research design, a controlled trial is the best for investigating the effect of a treatment on performance, injury or health (Hopkins, 2000a). In a controlled trial, subjects in experimental and control treatment groups are measured at least once pre and at least once during or post their treatment. The essence of the analysis is a comparison between groups of an appropriate measure of change in a dependent variable representing performance, injury, or health. The final outcome statistic is thus a difference in a change: the difference between groups in their mean change due to the experimental and control treatments. Another form of controlled trial is the crossover, in which all subjects receive all control and experimental treatments, with sufficient time following each treatment to allow its effect to wash out. The final outcome statistic in a crossover is simply the mean change between treatments.
Calculating the value of the outcome statistic in controlled trials and crossovers is easy enough. More challenging is making inferences about its true or population value in terms of a p value, statistical significance, confidence limits, and clinical significance. The traditional approach is repeated-measures analysis of variance (ANOVA). In a slideshow I presented at a conference recently, I pointed out how this approach can lead to the wrong p value and therefore wrong conclusions about the effect of the treatment in controlled trials. I also explained how the more recent approach of mixed modeling not only gives the right answers but also permits complex analyses involving additional levels of repeated measurement in addition to covariates for subject characteristics. Mixed modeling is available only in advanced, expensive, and user-unfriendly statistical packages, such as the Statistical Analysis System (SAS), but straightforward analysis of controlled trials either by repeated-measures ANOVA or mixed modeling is equivalent to a t test, which can be performed on an Excel spreadsheet.
In the last few years I have been advising research students to use a spreadsheet for their analyses whenever possible and to consult a statistician only when they need help with more complex analyses. Spreadsheets I have devised for various analyses (reliability, validity, assessment of an individual, confidence limits and clinical significance) seem to have facilitated this process. Although it may be instructive for students to devise a spreadsheet from scratch, they probably reach a more advanced level more rapidly by using a spreadsheet that already works: anytime they want to learn more about the calculations, they need only click on the appropriate cells. Use of a pre-configured spreadsheet surely also reduces errors and saves more time (the student's and the statistician's). I have therefore devised a spreadsheet for analysis of straightforward controlled trials, and I have modified it for analysis of crossovers.
Features of the spreadsheet for controlled trials include the following…
• The usual analysis of the raw values of the dependent variable, based on the unequal-variances unpaired t statistic.
The data in the spreadsheet are for one pre- and two post-treatment trials, and the effects are the differences in the three pairwise changes between the trials (post1-pre, post2-pre, and post2-post1). You can easily add more trials and effects, including parameters for line or curve fits to each subject's trials–what I call within-subject modeling (Hopkins, 2003a). As for the role of the unequal-variances t statistic, see the section on uniformity of residuals at my stats site (Hopkins, 2003b) and also the slideshow I referred to earlier. In short, never use the equal-variances t test, because the variances are never equal. (The variances in question are the squares of the standard deviations of the change scores for each group.)
With three or more groups (for example, a control and two treatment groups), you will have to use a whole new spreadsheet for each pairwise comparison of interest. Enter the data for two groups, save the spreadsheet, resave it with a new name, then replace one group's data with those of the third group. Save, resave, then copy and paste data and so on until you have all the pairwise group comparisons.
The spreadsheet does not provide any adjustment for so-called inflation of Type 1 error with multiple group comparisons or with the multiple comparisons between more than two trials (Hopkins, 2000b). These adjustments (Tukey, Sidak, and so on) probably don't work for repeated measures, and in any case they are nonsense, for various reasons that I detail at my stats site and in that slideshow. The main reason is that the procedure, which involves doing pairwise tests with a conservatively adjusted level of significance only if the interaction term is significant, dilutes the power of the study for the most important comparison (for example, the last pre with the first post for the most important experimental vs control group). I don't believe in testing for significance anyway, but even if I did, I would be entitled to apply a t test to the most important pre-planned comparison without looking at the interaction and without adjustment of significance for multiple comparisons. Some of the best researchers in our field fail to understand this important point.
For example, if the effect of the treatment is to enhance performance by a few percent regardless of a subject's pre-test value, analysis of the raw data will give the wrong answer for most subjects. For these and most other performance and physiological variables, analysis of the logarithm of the raw values gives the right answer. Along with logarithmic transformation, the spreadsheet has square-root transformation for counts of injuries or events, arcsine-root transformation for proportions, and percentile-rank transformation (equivalent to non-parametric analysis) when an appropriate formula for a transformation function is unclear or unspecifiable (Hopkins, 2003c and other pages).
A dependent variable with a grossly non-normal distribution of values and some zeros thrown in for good measure is a good candidate for rank transformation. An example is time spent in vigorous physical activity by city dwellers: the variable would respond well to log transformation, were it not for the zeros.
Dependent variables with only two values (example: injured yes/no) and Likert-scale variables with any number of points (example: disagree/uncertain/agree) can be coded as integers and analyzed directly without transformation (Hopkins, 2003b). I now code two-value variables and 2-point scales as 0 or 100, because differences or changes in the mean then directly represent differences or changes in the percent of subjects who, for example, got injured or who gave one of the two responses. Advanced approaches to such data involve repeated-measures logistic regression, but the outcomes are odds ratios or relative risks, which are hard to interpret.
If you have a variable with a lower and upper bound and values that come close to either bound, consider converting the values so that they range from 0 to 100 ("percent of full-scale deflection"), then applying the arcsine-root transformation. Composite psychometric scores derived from multiple Likert scales should behave well under this transformation, especially when a substantial proportion of subjects respond close to the minimum or maximum values.
• Plots of change scores of raw and transformed data against pre-test values, to check for outliers and to confirm that the chosen transformation results in a similar magnitude of change across the range of pre-test values.
These plots achieve the same purpose as plots of residual vs predicted values in more powerful statistical packages. Statisticians justify examination of such plots by referring to the need to avoid heteroscedasticity (non-uniformity of error) in the analysis, which for a controlled trial means the same thing as aiming for uniformity in the effect of the treatment.
Sometimes it's hard to tell which transformation gives the most uniform effect in the plots. Indeed, when there is little variation in the pre-test values between subjects, all transformations give uniform effects and the same value for the mean effect after back transformation. Regardless, your choice of transformation should be guided by your understanding of how a wide variation in pre-test values would be likely to affect the effects.
After applying the appropriate transformation, you may sometimes still see a tendency for subjects with low pre-test values to have more positive change scores (or less negative change scores) than subjects with high pre-test values. This tendency may be a genuine effect of the pre-test value, but it may also be partly or wholly an artefact of regression to the mean (Hopkins, 2003d). You address this issue by performing an analysis of variance or general linear model with the change score as the dependent variable, group identity as a nominal predictor, and the pre-test value as a numeric predictor or covariate interacted with group. The difference in the slope of the covariate between the two groups is the real effect of pre-test value free of the artefact.
• Various solutions to the problem of back-transformation of treatment effects into meaningful magnitudes.
Log-transformation gives rise to percent effects after back transformation. If the percent effects or errors are large (~100% or more, as occurs with some hormones and assays for gene expression), it is better to back-transform log effects into factors. For example, an increase of 250% is better expressed as a factor of 3.5.
Another approach, which as far as I know is novel, is to estimate the value of the effect at a chosen value of the raw variable. I have included this approach for back-transformation from percentile-rank, square-root, and arcsine-root transformations. See the spreadsheet to better understand what I mean here. Note that I have not included this approach with log transformation, because percent and factor effects are better ways to back transform log effects.
Finally, I have also expressed magnitudes of effects for the raw variable and for all transformations as Cohen effect sizes: the difference in the changes in the mean as a fraction or multiple of the pre-test between-subject standard deviation. You should interpret the magnitude of the Cohen effect sizes using my scale: <0.2 is trivial, 0.2-0.5 is small, 0.6-1.1 is moderate, 1.2-1.9 is large, and 2.0 or more is very large (Hopkins, 2002a). Use Cohen effect sizes for effects that relate to population health, for effects derived from physiological, biomechanical, or other mechanism variables, and for some measures of performance that have no direct association with competitive performance scores (for example, field tests for team-sport athletes). Do NOT use Cohen effect sizes for direct measures of competitive athletic performance or performance tests directly related thereto; instead express the effects as percents or as raw units.
• Estimates of reliability in the control group, expressed as typical error of measurement and change in the mean.
The design and analysis of the control group on its own amount to a reliability study. You can therefore compare measures of reliability derived from the control group with those of published reliability studies or your own pilot reliability study, if you did one. The most important measure of reliability is the typical error: the typical variation (standard deviation) a subject would show in many repeated trials (Hopkins, 2000c and my stats pages on reliability). The typical error shown in the spreadsheet is simply the standard deviation of the change score in the control group divided by Ö2. If your error differs substantially from that of previous studies, you should try and explain why in your write-up. Note that one obvious explanation for a worse error in your study is a longer time between trials compared with that in reliability studies.
The change in the mean from trial to trial is another measure of reliability. A substantial change usually indicates that the subjects (or the researcher) showed a practice, learning, or other familiarization effect between the trials. Again, you will need to explain any such change in your study. A substantial change in the mean can also help explain a worse typical error than in previous studies: there are usually individual responses to the change (some subjects show more familiarization), and individual responses produce more variation in change scores.
• Estimates of individual responses to the treatment, expressed as a standard deviation for the mean effect, in the various forms of the transformed and back-transformed variable.
If subjects differ in their response to the experimental treatment after any appropriate transformation, the standard deviation of the change scores in the experimental group will be greater than that in the control group. It is easy to show from basic statistical theory that the variation in the response between individuals, expressed as a standard deviation, is the square root of the difference in the variances (square of the standard deviations) of the change scores. This estimate of variation is free of error of measurement, although error of measurement obviously contributes to its uncertainty.
The standard deviation representing individual responses is the typical variation in the response to the treatment from individual to individual. For example, if the mean response is 3.0 units and the standard deviation representing individual responses is 2.0 units, most individuals (about two-thirds) will have a true response somewhere in the region of 1 to 5 (3-2 to 3+2). You can state that the effect of the treatment is typically 3.0 ± 2.0 units (mean ± standard deviation).
Sometimes the observed difference in the variances of the change scores is negative, either because the experimental treatment somehow reduces variation between subjects (for example, by bringing them up to a common ceiling or plateau), or more likely because sampling variation results in a negative difference purely by chance. I have devised the convention of showing the square root of a negative variance as a negative standard deviation. You should interpret such negative standard deviations as indicative of no individual responses.
If you find that there are substantial individual responses to a treatment, the next step is to find the subject characteristics that predict them. To do that properly, you have to include subject characteristics in the analysis as covariates, which is not possible with the spreadsheet. However, you can do a publication-worthy job by correlating or plotting the change scores with the subject characteristics that you think might predict the individual responses. See the slideshow on repeated measures for more information on individual responses, including a caution about misinterpreting a poor correlation between change scores.
• A comparison of pre-test values of means and standard deviations in the experimental and control groups for all transformations and back-transformations, to check on the balance of assignment of subjects to the groups.
The purpose of a control group is to provide an estimate of the change that occurs in the absence of the experimental treatment. This change is then subtracted from the change in the experimental group to give the pure effect of the experimental treatment. Fine, but suppose the pre-test score affects the change resulting from the experimental and control treatments (example: subjects with lower pre-test scores respond more to either or both treatments). Suppose also that the mean pre-test score differs substantially between the groups–that is, the groups are not balanced with respect to the pre-test score. Under these circumstances, part of the difference between the change scores will be due to the lack of balance, so the analysis will not provide a correct estimate of the experimental effect. It is therefore important to compare the two groups and comment on the difference in the mean pre-test score. The spreadsheet provides you with such a comparison. You can describe the magnitude of the difference between the groups either by noting whether the difference is greater than the smallest worthwhile change (see the last bullet point below) or by interpreting the Cohen effect size for the difference as trivial, small, and so on, when Cohen is appropriate. It is equally important to compare the group mean values of any other subject characteristic that could interact with the two treatments (examples: age, competitive level, proportion of males…), but you will have to re-jig the spreadsheet for such comparisons yourself.
When there is a substantial difference between the groups in the pre-test mean value of the dependent variable or of another subject characteristic, make sure you examine the plots of the change scores against pre-test value or create plots to examine the effect of the other subject characteristic on the change scores. If you see a substantial effect of pre-test value on the change scores, perform the analysis of variance or general linear model I described in the section on plots of change scores to deal with regression to the mean. That way you will have adjusted statistically for the differences in the groups in the pre-test. Your stats package should give you the experimental effect as the difference in the change for subjects who would be on the overall mean value of the covariate, because this value corresponds to the expected value for the population represented by the full sample. You might need an expert to help you here.
You can also use this part of the spreadsheet for a comparison of means (and standard deviations) of any two independent groups without repeated measurements. Just ignore all the analyses related to changes in the means.
• Estimates of uncertainty expressed as confidence limits at any percent level for all effects, including the standard deviations representing individual responses.
Confidence limits define a range within which the true value (that is, the population value) of something is likely to occur, where likely is traditionally with 95% certainty. I now favor 90% confidence limits, because I believe 95% limits give an impression of too much uncertainty. They are also too easily reinterpreted in terms of statistical significance at the 5% level, which I am also keen to avoid.
The confidence limits for the treatment effects are generated using the same formulae as in my spreadsheet for confidence limits. Confidence limits for the standard deviation representing individual responses are based on one of the methods used in Proc Mixed in SAS. The sampling distribution of the difference of the variances of the change scores is assumed to be a normal distribution. The variance of the sampling distribution of a variance is 2(variance)2/(degrees of freedom). The variance of the difference in the variances is simply the sum of the two sampling variances, because the control and experimental groups are independent.
I have supplied confidence limits for the comparison of the two groups in the pre-test, but I don't think we should use them for controlled trials. After all, any difference between the two groups is real as far as imbalance in the assignment is concerned. What matters is how different the groups were for the study, not how different the groups could be if we drew a huge sample to get the difference between the population values for the control and experimental groups. But if you are using the spreadsheet only for a comparison of means and standard deviations of a single measurement of subjects in two groups, then of course confidence limits are appropriate.
• Chances that the true value of an effect is important, for all comparisons of means.
Important means worthwhile or substantial in some clinical, practical, or mechanistic sense. You provide a value for the effect that you consider is the smallest that would be important for your subjects. The spreadsheet then estimates the chances that the true value is greater than this smallest important value, and it also shows the chances in a qualitative form (unlikely, possible, almost certain, and so on).
Sometimes you have no clear idea of the smallest important value, especially for physiological or other mechanism variables. In such cases, use Cohen's smallest effect size of 0.2, which is included in the spreadsheet as the default. By inserting values of 0.6, 1.2, or 2.0, you can also estimate the chances that the true value is moderate, large, or very large. Try these different values when you want to say something positive about a smallish effect that has considerable uncertainty arising from large error of measurement, individual responses, or a small sample size. You can usually state something like the mean effect could be trivial or small, but it is unlikely to be moderate and is almost certainly not large. Such statements will help get your otherwise apparently inconclusive study into a journal, if the reviewers and editor are enlightened.
The example in the spreadsheet is for a control treatment and two experimental treatments, with equal numbers of subjects on each of the six possible treatment sequences. For a study with a single experimental treatment, ignore or delete the columns relating to the second experimental treatment. Add more columns to add more treatments. You can also use the spreadsheet to analyze a time series, in which all subjects receive the same treatment sequence. This design obviously isn't as good as a crossover, but it can still provide valuable information if you do multiple trials before, during and/or after the treatment.
The analyses are based on the paired t statistic. For convenience, the t statistic is generated by comparing a column of change scores with a column of zeros rather than by comparing directly the columns of scores for each treatment. This approach also allows you to include and analyze a column representing any effect derived from within-subject modeling, such as a slope representing a gradual change in a time series, or parameters describing a dose-response relationship when the treatments differ only in dose.
Plots of change scores in a crossover fulfill the same roles as in a controlled trial: to check for outliers and to check that your chosen transformation makes the effect reasonably uniform across the range of subjects' values for the control treatment (the equivalent of the subjects' values for the pre-test in a controlled trial). Beware that regression to the mean will sometimes be responsible partly or entirely for any trend towards experimental-control changes that get more positive for lower control scores, even in the plots with the most appropriate transformation. To see the effect of the control value on the experimental treatment free of this artifact, one of the treatments in the crossover needs to be a repeat of the control treatment. The plot of the experimental-control change scores for one control against the values for the other control shows the pure effect of the control value on the experimental treatment. Fit a line or a curve to the points to quantify the effect, or do an equivalent analysis with mixed modeling.
Lack of a control group in a crossover precludes estimation of reliability, but I have included estimates of the typical error derived from the changes between treatments. This error is still formally a measure of reliability. If you compare your values with those from reliability studies, take into account the possibility that individual responses to one or more of your treatments can inflate the error.
Another potential source of inflation of error in a crossover is a substantial familiarization effect, particularly between the first and second trials. This effect adds to the change scores of subjects who have the control treatment first, whereas it subtracts from those who have it second. If there are equal numbers of subjects in each treatment sequence, there is no nett effect on the mean change score (the treatment effect), but the random error increases. Analysis with mixed modeling allows you to estimate the magnitude of the familiarization effect and to eliminate this source of error by including the order of the treatments as a within-subject nominal covariate in the fixed-effects model. You end up with a more precise estimate (narrower confidence interval) for the treatment effect. Mixed modeling also provides a check on the balancing of the assignment. Get your subjects to repeat the control treatment, preferably in a balanced manner, and it will also provide you with estimates of reliability and individual responses.
In conclusion, the spreadsheets should help you analyses your data appropriately. Admittedly, you can't include covariates in the analysis, but the spreadsheet should still help you do the right things when you use a more sophisticated statistical package.
Link to reviewer's comment.
Hopkins WG (2000a). Quantitative research design. Sportscience 4(1), sportsci.org/jour/0001/wghdesign.html (4318 words)
Hopkins WG (2000b). Getting it wrong. In: A New View of Statistics. newstats.org/errors.html
Hopkins WG (2000c). Measures of reliability in sports medicine and science. Sports Medicine 30, 1-15 (download reprint)
Hopkins WG (2002a). A scale of magnitudes for effect statistics. In: A New View of Statistics. newstats.org/effectmag.html
Hopkins WG (2002b). Probabilities of clinical or practical significance. Sportscience 6, sportsci.org/jour/0201/wghprob.htm (638 words)
Hopkins WG (2003a). Other repeated-measures models. In: A New View of Statistics. newstats.org/otherrems.html
Hopkins WG (2003b). Models: important details. In: A New View of Statistics. newstats.org/modelsdetail.html
Hopkins WG (2003c). Log transformation for better fits. In: A New View of Statistics. newstats.org/logtrans.html. (See also the pages dealing with other transformations.)
Hopkins WG (2003d). Regression to the mean. In: A New View of Statistics. newstats.org/regmean.html