A New View of Statistics

© 1997 Will G Hopkins

Go to: Next · Previous · Contents · Search · Home

Summarizing Data:
 Simple Statistics: THE MIDDLE
A statistic that represents the middle of the data is called a measure of centrality. The best is the mean or average. Just add up all the numbers and divide by the sample size. The mean is the best measure, partly because it uses more information in the data than any other measure of centrality.

The median, or "middle" number, can be useful for data with a non-normal distribution. To work it out, arrange the numbers in rank order (smallest to largest), then count in from one end until you find the middle. (If the sample size is an even number, take the average of the two middle numbers.) The median is not affected by outliers, which is a big point in its favor. But if you're interested in getting an estimate of the center of a population or of a subgroup of a population--and you usually are--the median is a coarse or "noisy" measure.

The mode, or most frequent number, is the only other measure of centrality you'll ever encounter. I've never used it.

 Simple Statistics: THE SPREAD
Some statistics give an idea of spread, variation, or dispersion of the numbers. The simplest measure of spread is the range, expressed either as the biggest and smallest number in the data (e.g. 61-74), or as the difference between the biggest and smallest (e.g. 13).

The range is a bad measure of spread, for two reasons. First, it's dictated by outliers, whether they're errors in data entry or genuine values. Secondly, the range is dependent on the size of your sample: the more numbers, the bigger the range is likely to be. Two measures of spread that avoid these problems are the standard deviation (SD) and percentile ranges. I'll deal with these separately, and with these other measures of variation: the root mean square error (RMSE) and the standard error of the estimate (SEE). I explain on a separate page why the standard error of the mean is a measure of spread you should not use.

The statistics most people use to describe a set of numbers are sample size, mean, and standard deviation. All you need to define the shape of the normal distribution is the mean and the standard deviation. The mean and standard deviation are often written as mean ± SD: 67.8 ± 3.6 kg, for example.

In dealing with the spread in a bunch of numbers, we often think about the numbers as representing values of some characteristic, such as weight, for different subjects. But the bunch of numbers could represent the weight of a single subject measured many times. We talk about between-subject variation and within-subject variation to distinguish between these two types of spread. Within-subject varation comes up soon as a useful measure of reliability.

Go to: Next · Previous · Contents · Search · Home
webmaster=AT=sportsci.org · Sportsci Homepage
Last updated 4 July 00