A New View of Statistics 

Go to: Next · Previous · Contents · Search · Home 
I also advocate a new approach to samplesize estimation based on width of
the confidence interval of your outcome measure. In this new approach, your
concern is with the precision of your estimate of the effect, not with the statistical
significance of the effect. The formulae on these pages still apply, but you
halve the sample sizes.
Let's look at a simple example of the smallest effect worth detecting. Your research project includes the question of differences in height of adults in two regions. This sounds like a trivial project, but hey, the difference might be caused by a nutritional deficit, environmental toxin, level of physical activity, or whatever. OK, what difference in height would you consider to be the smallest difference worth noticing or commenting on? Almost everyone reading this paragraph will automatically start thinking either in inches or centimeters. So what's your choice? An inch, or 2.5 cm? Sounds like a nice round figure! Let's go with it for now.
To use my approach to samplesize estimation, you convert this difference into a value for the effectsize statistic. To do that, you divide it by the standard deviation, expressed in the same units. The standard deviation here is just the usual measure of spread, except that we have two groups. So let's assume we have an average of the standard deviation in both groups. Let's say it is 2 inches, or 5 cm. So, if you want to detect 2.5 cm, and the standard deviation is 5.0 cm, the smallest effect worth detecting is 2.5/5.0, or 0.5.
I'll talk about what I mean by detecting in a minute. First, more about
the smallest effect. You'll discover shortly that the required number of subjects
is quite sensitive to the magnitude of the smallest worthwhile effect. In fact,
halving the magnitude quadruples the number of subjects required to detect it.
So the way you decide on the smallest effect is important. How did we arrive
at that minimum difference of 2.5 cm? In my experience, most researchers dream
up a number that sounds plausible, just like we did here. Well, sorry, but you
just can't do it like that. In fact, you don't have the freedom to choose the
minimum effect. In all but a few special cases, it's the threshold for small
effects on the scale of magnitudes: 0.2 for the
Cohen effectsize statistic, 10% for a frequency difference, and 0.1 for a correlation.
You need the same sample size to detect each of these effects, and as we'll
see, it's 800 subjects for a simple crosssectional study in the oldfashioned
way of doing the figuring. It's even more than 800 when you factor in the validity
of your variables. But don't panic. We'll also see that there are ways of reducing
this number, sometimes drastically.
First, statistical significance. The difference is statistically significant, by definition, if the 95% confidence interval does not overlap zero, or if the p value for the effect is less than 0.05. Values of 95% or 0.05 are also equivalent to a Type I error rate of 5%: in other words, the rate of false alarms in the absence of any population effect will be 5%. We don't have any choice here. It has to be 5%, or less preferably, but most researchers opt for 5%. If you want a lower rate of false alarms, say 1%, you will need more subjects.
Now, what about being sure that the effect will turn up? In other words, if
the effect really is 2.5 cm in the populations, how sure do we want to be that
the difference observed in our sample will be statistically significant? We
don't have any choice here, either. We have to be at least 80% sure of detecting
the smallest effect. To put it another way, the power of the study to
detect the smallest effect has to be at least 80%. Or to put it yet one more
way, the Type II error ratethe rate of failed
alarms for the smallest effectis set at 20% or less. That's one chance in
five of missing the thing you're looking for!?! Sounds a bit high, but keep
in mind that it is the rate for the smallest worthwhile effect. The chance
of missing larger effects is smaller. Once again, if you want to make the error
rate lower, say 10%, you will need more subjects.
Crosssectional designs include correlational, casecontrol, and any other design with single observations for each subject. Some socalled prospective designs, where subjects are followed up over time, are crosssectional if there is only one value for each variable for each subject. Crosssectional studies need heaps of subjects, and the number is affected by the validity of the variables.
Longitudinal designs include time series, experiments, controlled
trials, crossovers, and anything else where the dependent variable is
measured twice or more. The data have to be subjected to repeatedmeasures
analysis. The usual thing with these designs is a measurement
before and after you do something, to see if what you do has any
effect. Whether or not you have a control group, it is always the
case that subjects "act as their own controls", because there are
always pre and post measurements on the subjects. Longitudinal
designs generally need far fewer subjects than crosssectional
designs, depending on the reliability of
dependent variable.
Sample Size for CrossSectional Studies
For variables with perfect validity, you can now look up
tables or run special software to see how many subjects you need.
(G*power
is a great little free program for the purpose.) Or use the following
simple formula I have worked out:
For Type I and II errors of 5% and 20%, the total number of subjects N is given by:
N = 32/ES^{2}, where ES is the smallest effect size worth detecting.
Example: for ES = 0.2, the total N is 800, which means 400 in each group for a casecontrol study or a study comparing males and females. So for our study of differences in height, we'd need 400 in each group.
What about if the outcome is a difference in the frequency of something in the two groups, for example the frequency of clinical obesity. The minimum worthwhile difference is 10% (e.g. 25% in one group and 35% in the other). You just think about that difference as being equivalent to an effect size of 0.2, and plug it into the formula: 400 in each group again.
And finally what about sample size to detect a correlation, for example the
correlation between physical activity and body fat? Same story: 800 subjects
to detect the minimum worthwhile correlation of 0.1, because a correlation of
0.1 is equivalent to an effect size of 0.2. For larger correlations use the
scale of magnitudes to convert the correlation
to an equivalent effect size, then plug it into the formula.
For the rare cases where you have the luxury of Type I and II errors of 1% and
10% respectively, the number is nearly double: N = 60/ES^{2}.
Validity of the variables can have a major impact on sample size in crosssectional studies. The lower the validity, the more the "noise in the signal", so the more subjects you need to detect the signal. If the validity correlation of the dependent variable is v (Pearson, intraclass, or kappa), the number of subjects increases to N/v^{2}.
To detect a correlation between variables with validities v and w,
the number is
N/(v^{2}w^{2}).
Sample sizes may therefore have to be doubled or quadrupled when
effects are represented by psychometric or other variables that have
modest (~0.7) validity.
Sample Size for Longitudinal Studies
In our first example on this page, we had a crosssectional
design in which we were interested in the difference in height
between people in two regions. Now, in a longitudinal design, we
might want to know whether a stretching exercise makes people taller.
Can you see that the same concept of minimum effect size still holds
here? If we thought one inch was the smallest difference worth
detecting between groups, then it has to be the smallest difference
we would like to see as a result of our stretching exercise. (It
might need a medieval rack to make people a whole inch taller!)
Once again we don't have a choice about that minimum effect: it's still an effect size of 0.2 standard deviations, and the standard deviation is still the usual standard deviation of the subjects. At the moment we have only one group of subjects, and the standard deviation before we put people on the rack is usually about the same as after the rack. So you can think about the minimum effect size as a fraction of either standard deviation. But note well: do not use the standard deviation of the beforeafter difference score.
Reliability of the dependent variable is the final piece of the jigsaw. The higher the reliability, the more reproducible are the values for each subject when you retest them, which makes it more likely you will detect a change in their values. So the higher the reliability, the less subjects you need to detect the minimum effect. Read the earlier section on sample size for an experiment for an overview of the role of typical error in samplesize estimation, and for an important detail about the conditions in a reliability study aimed at estimating sample size.
The rest of this section contains details of formulae that you may not need to worry about. You can use two forms of reliability in the formulae: retest correlation and withinsubject variation.
Using the Retest Correlation
First, a couple of cautions. The retest correlation is for retests with
the same time between the tests as you intend to have in your experiment. For
example, if you are doing an intervention that lasts 2 months, you need a 2month
retest correlation. Don't use a 1day retest correlation unless you have good
grounds for believing that it will be the same as a 2month retest correlation.
Also, the spread between the subjects in your study has to be similar to the
spread between the subjects in the reliability study. If the spread is different,
the value of the retest correlation coefficient will be inappropriate. In that
case you will need to calculate the appropriate value by combining the within
(s) and between (S) standard deviations for your subjects using this
formula:
retest correlation r =
(S^{2}s^{2})/S^{2}.
Right, here's the strategy for working out the required sample size when you know the retest correlation:
Using the WithinSubject Variation
You can also think about the difference between the post and pre means
in terms of the withinsubject variation (standard deviation). For example,
if the performance of an individual athlete varies by 1% (the withinsubject
standard deviation expressed as a coefficient of variation), how many athletes
should you test to detect a 1% change in performance, or a 2% change, or a 0.5%
change? Here is the formula:
Example: You want to detect (p=0.05, 80% power) a 2% change in performance when the coefficient of variation is 2%. The corresponding value of f is 1.0, which means you'd need to test 16 athletes in a crossover design, or 32 in each of a control and experimental group. Or it's 8 or 16+16, if you justify sample size using precision of estimation.
What's the smallest value of f worth detecting? Is it 1.0? Not an easy question! To answer it, you usually have to bring in the betweensubject variation one way or another. Why? Because you can't get away from the fact that the magnitude of a change in the value of a variable usually has to be thought about in terms of the variation in the values of that variable between subjects. That's what minimum worthwhile effect sizes are all about. For example, if the betweensubject variation is 5%, the smallest difference worth detecting is 0.2*5% or 1%. So, if your withinsubject variation of 2%, you have to chase an f of 0.5. But if the betweensubject variation is 10%, the smallest worthwhile effect is 0.2*10% or 2%, so you chase an f of 1.0.
Once you bring the betweensubject variation back into the picture, you have all the ingredients for expressing the reliability as a retest correlation, so you can use the formulae with the retest correlation. For example, a within of 2% and a between of 5% implies a retest correlation of (5^{2}2^{2})/5^{2} or (254)/25 or 0.84. A within of 2% and a between of 10% implies a correlation of (1004)/100, or 0.96. Use these correlations in the formulae for sample size and you'll get the same answers as in the formulae using f. But if you have a reasonable notion of the smallest worthwhile change in a variable without explicitly knowing the betweensubject standard deviation or the correlation, use the formula with d and s (or f).
There is certainly one situation where it's better to use the withinsubject
variation: estimation of sample size in studies of athletic performance. When
athletes are subjects and competitive performance is the outcome, the smallest
worthwhile effect is an enhancement that increases the medal prospects of a
top athlete, not the average athlete. For sports like track and
field, this minimum effect is about 0.5 of the typical variation in a top athlete's
performance between events. For example, if the typical variation between events
is 1.0%, then you're interested in enhancements of about 0.5%. So if you use
a lab test with the same typical error as the competitive event, f in the above
formulae is simply 0.5, so you would need 64/0.5^{2},
or 256 subjects for a fully controlled study. That's bad enough, but if your
lab test has a typical variation of 2.0%, f is 0.5/2.0, which means 1024 subjects!
Oh no! Clearly you need very reliable lab tests if you want to detect the smallest
effects that matter to top athletes. See this Sportscience article for more
information:
Hopkins WG, Hawley JA, Burke LM (1999). Researching worthwhile
performance enhancements. Sportscience 3, sportsci.org/jour/9901/wghnews.html
Sample Size for Complex CrossSectional Studies
I'll deal with two groups of unequal size, more than two
groups, and more than one independent variable. Anything else
requires simulation.
Two Groups of Unequal Size
Up to this point I have assumed equal numbers in each group,
because that gives the most power to detect a difference between the
groups. But sometimes unequal numbers are justified.
The simplest case is where you have far more in one group than another. For example, you already have the heights for thousands of control subjects from all over the country, and you want to compare these with the heights of people from a particular region you are interested in. So, how many subjects do you need in that particular group? And the answer is... as few as onequarter the usual number! But you will need to test, or have the data for, an "infinite" number of subjects in the other group for the number to be that low. How big is infinite? For the purposes of statistical power, about 5 times as many as in the specialinterest group is close enough.
I have a formula, but to understand how to apply it will need a lot of thought. If you have samples of size n_{1} and n _{2}, then your study will have the power equivalent to a study with a sample size of N equally divided between two groups, where:
N = 4 n_{1 }n_{2}/( n_{1} + n_{2})
For example, if you have data for 1000 controls (=
n_{1}), and 800 (= N) is the
number you would normally require for equalsized groups, then the
above formula shows that you need to test only 250 cases (=
n_{2}). If you make
n_{1} very large, the formula
simplifies to N = 4 n_{2},
or n_{2} = N/4, which is
onequarter the usual total number.
More Than Two Groups
Suppose we wanted to compare the heights of people in more
than two regions. What should we do about the sample size? Do we need
more than 400 in each region, less than 400, or just 400? And the
answer is... it depends on what estimates
or contrasts you want to perform.
If you are interested in comparing one particular region with another particular region, you will still need 400 in each of those regions to keep the same power to detect a difference. The fact that you have all those other regions in the analysis matters not a jot, I'm afraid. They don't increase the power of the design unless the number in each region is about 10 or less, which it never should be!
If you are interested in comparing one particular region with the mean of every other, you've got the usual twogroup design, but with 400 subjects in the region of interest and 400 divided up equally into the other regions.
If you want to do every possible comparison between pairs of
regions, or between pairs of groups of regions, things start to get
complicated. As far as I can see, with six regions, say, only five
completely independent comparisons are possible. So if you are
concerned about inflation of the Type
I error, you will need to apply Bonferroni's
correction by reducing the p value to 0.05/5, or 0.01. Alas, a
smaller p value means a bigger sample size. It's difficult to work
out exactly what it should go up to, because somehow or other the
inflated Type II error should also
be taken into account. Certainly, nearly doubling the group size from
the usual 400 would be a good start in this example, because as we've
already seen on this page, that would be
equivalent to a p value of 0.01 and a Type II error of 10%, instead
of the usual 0.05 and 20%.
More Than One Independent Variable
Suppose you intend to measure half a dozen things like age,
sex, body fat, whatever, and you want to know the effect of each of
them on severity of injury in a particular sport. How many subjects
do you need?
Before we get clever with complex models for this question, let's take in the big view. If we treat each variable as a separate issue, it should be obvious that there will be a problem with inflation of the Type I error: none of the variables you've measured might predict severity of injury in the population, but if you have enough variables, there's a good chance one will predict injury in your sample. So you'll need to reduce your p value using Bonferroni's 0.05/n, where n is the number of independent variables. This correction will be too severe if the independent variables are correlated, but I don't know how to adjust for that.
When you analyze the data, you should look at the effect of the independent variables separately to start with, but you will also end up using multiple linear regression, analysis of covariance, or some other complex model, with all the independent variables on the righthand side of the model. As I explained on the first page devoted to complex models, you are now asking a question about how much each variable contributes to the severity of injury in the presence of (when you control for) the others. How many subjects do you need to answer this question? Theoretically the extra independent variables shouldn't make much difference, but I've checked by simulation to make sure. You need one extra subject for each extra independent variable. With five extra variables, that makes five extra subjects. Forget it. With a thousand or so subjects, five won't make any difference.
Here's a different problem involving more than one independent variable, where you don't have to worry about increasing the sample size to reduce the Type I error. Suppose you are currently predicting competitive performance from four lab and field tests, and you want to know whether it's worth adding an expensive fifth test to the test battery. For this sort of problem, you would model the data by doing a multiple linear regression, with the expensive test as the last independent variable in the model. So, how many subjects? It's a specific extra variable in this case, so there is no inflation of the Type I error, so the sample size is still about 800. But if all the field tests were in there on an equal footing, and you wanted to know which ones to drop out of the test battery, then it's back to the bigger sample size of the previous example. In this case you'd use stepwise regression with a reduced p value for entry of variables into the model.