A New View of Statistics 

Go to: Next · Previous · Contents · Search · Home 
I deal first with confidence limits, which are the simplest and best way to understand generalization. Bootstrapping, metaanalysis, and Bayesian analysis are applications of confidence limits that I include on this page. On the next page are the related concepts of p values and statistical significance, followed by type I and II errors and a mention of bias. You can also download a slideshow that deals with all the material on these three pages, and more.
The second section is devoted to how we use statistical models or tests to generalize the relationships between variables. To generalize properly you need a sample of adequate size, so I deal with methods for estimating sample size in the final section.
The only way you can really get the population value is to measure everyone in the population. Even if that was possible, it would be a waste of resources. But it is possible to use your sample to calculate a range within which the population value is likely to fall. "Likely" is usually taken to be "95% of the time," and the range is called the 95% confidence interval. The values at each end of the interval are called the confidence limits. All the values between the confidence limits make up the confidence interval. You can use interval and limits almost interchangeably.
Learn this plainlanguage definition: the confidence interval is the likely range of the true value. Note that there is only one true value, and that the confidence interval defines the range where it's most likely to be. The confidence interval is NOT the variability of the true value or of any other value between subjects. It is nothing like a standard deviation. If there are individual differences in the outcome, then there is more than one true value, but we'll deal with that later.
Another important concept embodied in confidence limits is precision of estimation. The wider the confidence interval, the less the precision. Research is all about getting adequate precision for things like a correlation coefficient, a difference in the mean between groups, the change in a mean following a treatment, and so on.
Notice that the confidence limits in the above example are not spaced equally on each side of the observed value. That happens with nonnormally distributed statistics like the correlation coefficient. Most other statistics are normally distributed, so the observed value falls in the middle of the confidence interval. For example, an observed enhancement in performance of 2.3% could have confidence limits of 1.3 to 3.3%. In such cases, you can use a ± sign to express the outcome in the following way: the enhancement was 2.3%, and the likely range (or confidence interval or limits) was ±1.0%. Of course, you mean by this that the limits are 2.31.0 and 2.3+1.0.
The lower and upper confidence limits need to be interpreted separately. The lower (or numerically smaller) limit shows how small the effect might be in the population; the upper limit shows how large the effect might be. Of course, you'll never know whether it really is that small or big unless you go out and measure the whole population. Or more subjects, anyway. Which brings us to the next important point: the more subjects, the narrower the confidence interval.
Notice that you can't say anything useful about the population correlation when the sample has only 4 subjects. Already with 16 subjects you get the idea that it could be moderately positive. With 64 subjects the correlation is definitely positive and probably large, although it could also be moderate. The sample of 256 nails it as a large effect, and 1024 subjects give too much precision. The conclusions I have shown in the above figure are only approximate. Since drawing this figure, I have come up with an exact approach to making conclusions like probably large. See below.
We have a couple of plainlanguage ways of talking about something that is statistically significant: we say that the true value is unlikely to be zero, or that there is a real effect. These aren't bad ways to think about statistical significance, and you can sort of understand them by looking at the above figure, but they're not strictly correct. After all, the true value of something is never exactly zero anyway. I'll pick this issue up on the next page, under hypothesis testing.
The value for a statistic corresponding to no effect in the population is called the null value. For correlations and changes in the mean, the null value is zero. If the outcome statistic is a relative risk or odds ratio, the null value is 1 (equal risk or odds). So for these statistics, the result is statistically significant if the confidence interval does not overlap 1.
Unfortunately, some stats programs don't provide confidence limits, but they all provide p values. I've therefore made a spreadsheet to calculate confidence limits from a p value, as explained on the next page. The calculation works for any normally distributed outcome statistic, such as the difference between means of two groups or two treatments. I've included calculations for confidence limits of relative risks and odds ratios, correlations, standard deviations, and comparison (ratio) of standard deviations.
I've also added columns to give chances of clinically or practically important effects. Make sure you come to terms with this stuff. It is more important than p values.
Update Oct 2007: the spreadsheet now generates customizable clinical and mechanistic inferences, consistent with an article on inferences in Sportscience in 2005. The inferences are also consistent with an article on samplesize estimation in Sportscience in 2006.
Spreadsheet for confidence limits and inferences: Download 
For example, you might want to use a fitness test in a large study, so you do a pilot first to see which of two tests is better. The tests might be submaximal exercise tests to determine maximum oxygen uptake. "Better" would mean the test with higher validity, in other words the test with the higher correlation with true maximum oxygen uptake. So you might get a sample of 20 subjects to do the two tests and a third maximal test for true maximum oxygen uptake. The validity correlations turn out to be 0.71 and 0.77. Sure, use the test with the higher correlation, but what if it's more difficult to administer? Now you begin to wonder if the tests are really that different. The difference is 0.06. That's actually a trivial difference, and if it was the real difference, it wouldn't matter which test you used. But the observed difference is never the real difference, and that's why we need confidence intervals. If the confidence interval was 0.03 to 0.09, you'd be satisfied that one test is a bit better than another, but that it still doesn't really matter, and you would choose the easier test. If the confidence interval was 0.11 to 0.23, you couldn't be confident about which test is better. The best decision then would be to test more subjects to narrow down the confidence interval.
Anyway, bootstrapping is how you can get the confidence interval. The term bootstrapping refers to the old story about people lifting themselves off the ground by pulling on the backs of their own boots. A similar seemingly impossible thing occurs when you resample (to describe it more formally) to get confidence intervals. Here's how it works.
For a reasonably representative sample of maybe 20 or more
subjects, you can recreate (bootstrap) the population by duplicating
the sample endlessly. Sounds immoral, if not impossible, but
simulations have shown that it works! Next step is to draw, say, 1000
samples from this population, each of the same size as your original
sample. In any given sample, some subjects will appear twice or more,
while others won't be there at all. No matter. Next you calculate the
values of the outcome statistic for each of these samples. In our
example above, that would be the difference between the correlations.
Finally, you find the middle 95% of the values (i.e. the 2.5th
percentile and the 97.5th percentile). That's the 95% confidence
interval for your outcome! Cool, eh?
The median value from your 1000 samples should be virtually the same
as the value from the original sample. If it's not, something is
wrong. Sometimes the variables have to be transformed in some
way to get over this problem. For example, to get the confidence
interval for the difference between correlation coefficients, you
first have to convert the correlations using something called the
Fisher z transformation:
z = 0.5log[(1 + r)/(1  r)].
This equation looks horribly complicated, but all it does is make the
correlations extend out beyond the value 1.0. It makes them behave
like normally distributed variables.
How do you "duplicate endlessly" to recreate the population? Actually you don't duplicate the data set. If your original sample had 20 observations, you use a random number generator in the stats program to select a sample of 20 from these 20. Then you do it again, and again, and again...
At the moment I don't know of a good rule to decide when a sample is big enough to use bootstrapping. Twenty observations seems to be OK. Note, though, that if you have subgroups in your data set that are part of the outcome statistic, you need at least 20 in each subgroup. For example, if you wanted to compare a correlation in boys and girls, you would need at least 20 boys and 20 girls.
And now for a test of your understanding. If you can recreate the population by duplicating the sample endlessly, why bother with all that resampling stuff? Why not just work out the value of the statistic you want from say a oneoff sample of a million observations taken from this population? With a million observations, it will be really accurate! Answer: Well, ummm... the value you calculate from a million observations will be almost exactly the same as the value from your original sample of 20. You're no better off. OK, it was a silly question.
The finding in a metaanalytic study is the mean effect of all the studies, with an overall confidence interval. In deriving the mean, more weight is given to studies with better designs: more subjects, proper random selection from the population, proper randomization to any experimental and control groups, double blinding, and low dropout rate. Studies that don't meet enough criteria are sometimes excluded outright from the metaanalysis.
Whenever you read a metaanalysis involving longitudinal (experimental) studies, check to make sure the statistician used the correct standard deviation to calculate the effect size. It should always be the average standard deviation of the before and/or after scores. Some statisticians have used the standard deviation of the beforeafter difference score, which can make the effects look much bigger than they really are.
Suppose you're interested in the effect of a certain drug on performance. You study this problem by conducting a randomized controlled trial on a sample of a population. You end up with confidence limits for the true effect of the drug in the population. If you're a frequentist you publish the confidence limits. But if you're a Bayesian, you also factor in your prior belief about the efficacy of the drug, and you publish credibility limits representing your posterior (updated) belief. For example, you might have believed the drug had no effect (0.0%), and you were really skeptical, so you gave this effect confidence limits of 0.5% to +0.5%. You then did the study and found a positive effect of 3.0%, with confidence limits of 1.0% to 5.0%. Combine those with your prior belief and you end up with a posterior belief that the effect of the drug is 0.6%, with confidence limits of 1.0% to 3.2%. Let's assume a marginal effect is 1%, a small effect is 3%, and a moderate effect is 5%. A Bayesian concludes (from the credibility limits of 1.0% to 3.2%) that the drug has anything from a marginal negative effect to a small positive effect. A frequentist concludes (from the confidence limits of 1.0% to 5.0%) that the drug has anything from a marginal positive to a moderate positive effect.
There are formal procedures for combining your prior belief with your data to get your posterior belief. In fact, the procedure works just like a metaanalysis of two studies: the first study is the one you've just done to get an observed effect with real data; the other "study" is your prior belief about what the effect was. The observed effect and your belief are combined with weighting factors inversely proportional to the square of the widths of their confidence intervals. For example, if you have a very strong prior belief, your confidence (= credibility) interval for your belief will be narrow, so only a markedly different observed effect with a narrow confidence interval will change your belief. On the other hand, if you are not at all sure about the effect, your confidence interval for your prior belief will be wide, so the confidence limits for your posterior belief won't be much different from those provided by the data. To take this example to an extreme, if you have no prior belief, the posterior confidence limits are identical to those provided by the data.
A positive aspect of the Bayesian approach is that it encapsulates the manner in which we assimilate research findings. New evidence that agrees with our preconceived notions reinforces our beliefs, whereas we tend to disregard evidence that flies in the face of our cherished prejudices or has no apparent mechanism. Sure, but even as a frequentist you can tackle these issues qualitatively in the Discussion section of your paper. If you try to quantify your prior belief, you run into two problems. First, your belief and the real data are combined with weighting factors, but they are otherwise on an equal footing. That's acceptable to a frequentist only if it's quite clear that the outcome of the Bayesian analysis is still only a belief, not a real effect. Secondly, exactly how do you convert a belief into a quantitative effect, and how do you give it confidence limits? (Bayesians give their belief a complete probability distribution, but the principle is the same.) You couldand probably dobase the belief on the results of other studies, but you might just as well metaanalyze these other studies to get your prior "belief". In that case, though, your posterior "belief" will be identical to a metaanalysis of all the studies, including the one you've just done. In other words, it's not a Bayesian analysis any more.
Bayesian analysis may be justified where a decision has to be made with limited real data. The prior belief could be the average belief of several experts. When I hear of a specific example, I will update this page. Meanwhile, click here for a response to this section from Mike Evans, a Bayesian.