New View of Statistics: Reliability Applications

A New View of Statistics

©2001 Will G Hopkins

Go to: Next · Previous · Contents · Search · Home

Summarizing Data:
PRECISION OF MEASUREMENT continued

APPLICATIONS OF RELIABILITY
The applications are: estimating sample size for an experiment, estimating the extent of individual responses to a treatment in an experiment, assessing an individual with a single measurement or repeated measurements (with a spreadsheet for doing the calculations), and comparing precision of measures provided by tests, items of equipment, or operators of the equipment. These applications impact on the design of reliability studies that give you estimates of reliability you want to use, so information about design is scattered through this page. There's also a section on sample size for reliability studies at the end. Follow this link for a Powerpoint presentation on use of reliability to assess an individual, to estimate sample size in experiments, and to estimate individual responses to a treatment.

Sample Size for an Experiment
In an experiment, you measure something on your subjects (e.g., performance), do something to them, then measure them again to see the effect of what you've done. The effect shows up as a change in mean performance between the two measurements. The more reliable your measure of performance, the more precision there will be in the change in mean performance, so the less subjects you will need. To get an estimate of the number of subjects, you have to include a value for the smallest worthwhile change that could result from your treatment. After all, if you see such a change in your subjects, you should be able to conclude that a change of something like that magnitude really does happen with the treatment. I've devised various formulae to work out the sample size that gives adequate precision the smallest change, using either retest correlation or typical error as the measure of reliability. When I first got into this game, I favored the retest correlation. These days I'm all for typical error. Here's a summary emphasizing the crucial role of typical error, followed by an explanation of each point:

Find the noise in your measure--the value of the typical error from a reliability study with individuals and a time frame similar to those of your intended study.
Decide on the smallest signal--the smallest clinically or practically worthwhile change in the measure for your study group.
If the noise is less than the smallest signal, you can use the measure to make precise estimates of any experimental effects with a single test and retest and a sample of modest size.
If the noise is greater than the smallest signal, the measure will provide acceptable precision for effects smaller than the noise only with more testing (more subjects, or more pre and post tests).

The important point in Point 1 is to make sure the conditions and subjects in the reliability study are similar to those in the intended experiment. In particular, the time between consecutive pairs of trials in the reliability study should be similar to the time between the pre and post tests in the experiment. For example, if you intend to look at the effects of a two-month nutritional intervention on body fat, the reliability of body fat measurements with two months between measurements will give you a more realistic idea of sample size than the higher reliability you are likely to see if only two hours separate the measurements. With two hours between measurements, the typical error is likely to arise only from technological error: error arising from the apparatus or the operator of the apparatus. With two months between measurements, some subjects will get fatter and some will get thinner, so the typical error will include also biological "error": true variation within the subjects. It's against this background of biological variation and technological error that you measure changes in body fat resulting from your intervention.

Researchers don't devote enough attention to Point 2. I go into this point in detail on several pages: a scale of magnitudes, and formulae for sample size. In summary, for most studies of health, injury, and fitness of normal folks, the smallest effect is 0.2 of the between-subject standard deviation. For studies of athletic performance, the smallest effect is 0.3-0.5 of the typical variation (standard deviation) that a top athlete displays from competition to competition.

If only all our measures were as good as those in Point 3! The "modest" sample size based on adequate confidence limits is ~8s²/d² for a crossover, or 4x as many when there is a control group, where s is the noise (typical error or within-subject standard deviation) and d is the signal (smallest worthwhile change). So, when the noise in the measure is negligible compared with the smallest effect (s<<d), you can in theory do the experiment with one subject in a crossover and two in a controlled trial (one each in the treatment and control groups). But you should still use ~10 subjects, to be confident that the subjects in your study are representative of a wider population.

Most often the noise is greater than the smallest signal, as in Point 4. The noise comes either from technological error, or from random real changes in the subjects over the time frame of the study, or from individual responses to the treatment. Whatever the source of the noise, acceptable precision for the smallest effect demands either a large sample size (>>8 in a crossover; >>32 in a controlled trial) or several pre and post tests on each subject. Extra pre tests and post tests effectively reduce the noise in the measure, because you analyze the change between the average of the pre tests and the average of the post tests. It's the only option when your pool of subjects is limited.

It's certainly a good idea to do a reliability study before an experiment, either to estimate sample size or to make sure you've got your techniques right. But if you are reasonably confident about the techniques, I advocate getting stuck straight into the experiment. As I explain in sample size on the fly, if your treatment turns out to have a big effect, you needn't have done all the extra testing to get adequate precision.

Individual Responses to a Treatment
When the response to an experimental treatment differs between subjects, we say that there are individual responses to the treatment, or that there are individual differences in the response. For example, a treatment might increase the power output of athletes by a mean of 3%, but the variation in the true enhancement between individual athletes might be a standard deviation of 2.5%. In this example, most athletes would show positive responses to the treatment, some athletes would show little or no response, and some would even respond negatively. Note that this figure of 2.5% is not simply the standard deviation of the difference scores, which would include variation due to typical error. When I refer to individual responses, I mean variation in the true effect free of typical error. Although the primary aim in an experiment is to estimate the mean enhancement, it is obviously important to know whether the individual responses are substantial. Analysis of reliability offers one approach to this problem.

When individual responses are present, subjects show a greater variability in the post-pre difference score. Analysis of the experimental group as a reliability study therefore yields an estimate of the typical error inflated by individual responses. Comparison of this inflated typical error with the typical error of the control group or with the typical error from a reliability study allows you to estimate the magnitude of the individual responses as a standard deviation (2.5% in the above example). If the experiment consists of a pre-test, an intervention, and a post-test, the estimate is readily derived from basic statistical principles as root(2s²_expt - 2s²), where s_expt is the inflated typical error in the experimental group, and s is the typical error in the control group or in a reliability study. For example, if the typical error in the experimental group is 2%, and the typical error in the control group or in a reliability study is 1%, the standard deviation of the individual responses is 2.5% (= root6).

If you use the typical error from a reliability study to estimate the individual responses in your experiment, make sure the reliability study has a time frame and subjects similar to those in your experiment. And if your experiment is a crossover, there is no control group, so you have to use the typical error from a reliability study. Alternatively, use a complex crossover in which your subjects do several tests for each of the treatments.

You can also used mixed modeling to estimate individual responses. It's awfully complicated, but the extra effort is worth it, because you also get confidence limits for the estimate. When individual responses are present, the obvious next step is to identify the subject characteristics that predict the individual responses. The appropriate analysis is repeated-measures analysis of covariance, with the likely subject characteristics (e.g., age, sex, fitness, genotype) as covariates. Follow this link for more.

Assessing an Individual
Whenever you take a measurement to assess someone's fitness, fatness, or other characteristics, your measurement is contaminated by "noise"--the typical error. Sometimes the noise is small enough to neglect, as in measurement of body mass with any reasonable set of scales. But if the noise is not negligible, you should be up front about your uncertainty when you report the measurement to your patient or client. There are a couple of ways to express this uncertainty. I'll explain in detail shortly. First, here are the main points. Typical error and the smallest clinically/practically worthwhile/important change have the same key roles here as they do in sample-size estimation for experimental studies (see above):

The typical error of measurement is the key to making sense of a single measurement or a change in a measurement.
- The typical error needs to come from a short-term reliability study of individuals similar to the one you are assessing.
For a single measurement, the typical error really is the typical amount by which any single observed value is different from the true value.
For a change between two measurements, take into account not only the typical error (the "noise") but also the smallest clinically important change (the smallest "signal").
- If the noise is much less than the smallest signal, your measure is precise. Trust any change you see between a single test and retest.
- If the noise is much greater than the smallest signal, your measure is too noisy to be useful. Find a less noisy test.
- If the noise is about the same as the smallest signal, your measure is useful, but take into account the uncertainty in your measurements by using likelihoods or likely limits. You should also try to take multiple measurements and either average them to reduce the noise or look for a trend over time between the tests.

Typical Error for Assessing Individuals
When you wanted the sample size for an experiment, it was important to use an estimate of reliability from a reliability study with the same time between trials as in the experiment. But for a single measurement or a change in a measurement on an individual, you need an estimate of reliability with the minimum of biological variation. The period between measurements in the reliability study therefore needs to be brief. By brief I mean a period over which you wouldn't expect any real change in the variable you are assessing. For retests of skinfolds, brief could be an hour--anything longer and changes in the subject's posture or state of hydration might affect the measurements. For retests of physical performance, leave just enough time for all your subjects to recover from the fatigue of the previous test.

If there is a systematic change in the mean in the reliability study, do you take that into account in your subsequent assessments? In general, no, because changes in the mean in the reliability study will usually be due to changes within the subjects.

In what follows, I often refer to the true value of a subject's measurement. By true I mean the value free of typical error, which is the value you would get if you took hundreds of measurements on the subject and averaged them. There might still be a systematic error in this "true" measurement, but you would need to do a validity study to sort that out. That kind of systematic error is less likely to be a problem when you are interested in a change or difference between measurements, because the error will tend to disappear when you subtract one measurement from another.

A Single Measurement
You measure a gymnast and find a sum of seven skinfolds of 45.2 mm. The true value won't be exactly 45.2 mm, so one way to take measurement error into account is to specify a likely range or limits for the true value: a range within which the true value is likely to fall (for example, 42.2 to 48.2 mm). Likely can be anything we like. In research projects we usually opt for 95% likely, and later on I devote a whole page to the concept of confidence limits for generalizing from a sample to a population. The meaning is much the same here; the only difference is that we're talking about an individual rather than an average effect in the population. The 95% confidence or likely limits for an individual's true value have a 95% chance of enclosing that individual's true value. Or you can say the odds are 19 to 1 that the subject's true value will be within the range. You get 95% limits by multiplying the typical error by about ±2.0. Let's say your typical error is 1.5 mm for the sum of seven skinfolds on a sample of female gymnasts similar to your subject. The true value of the skinfold sum is therefore 95% likely to be within 45.2 ± 2.0x1.5, or 42.2 to 48.2 mm.

Do you tell the gymnast the 95% likely range? No, probably not. A certainty of 95% may be OK for research, but it's too much for assessing an individual. The range represented by ±1.0x the typical error--a 68% range, or 2 to 1 odds of enclosing the true value--is probably the best default way to convey your uncertainty about the true value. It's certainly the easiest to use! You just say to the gymnast, "the odds are 2 to 1 that your real skinfold thickness is between 45.2 ± 1.5, or 43.7 to 46.7 mm". If you are feeling more cautious, say instead "the odds are 9 to 1 that your real skinfold thickness is between 45.2 ± 2.5, or 42.7 to 47.7 mm." The table below summarizes the likely ranges, the odds, and the factors to multiply by your typical error. You can also use a spreadsheet for precision of a subject's true value .

Factors for generating likely (confidence) limits for the true value of a single measurement or of a difference or change in a measurement. "Likely" is defined by several values of probability or odds.
Likelihood that the limits will contain the true value		Multiply typical error by ± this factor to get the limits for…
Probability	Odds	a single measurement	a change in a measurement
52%	1 to 1	0.71	1.00
68%	2 to 1	1.00	1.41
80%	4 to 1	1.28	1.81
90%	9 to 1	1.65	2.33
95%	19 to 1	1.96	2.77^a
^aThis factor generates the 95% limits of agreement.

When the typical error is given as a percent, an approach similar to the above is usually accurate enough. For example, if the typical percent error is 3.0%, the 68% likely range of the true value of a single measurement is ±1.0x3.0 = ±3.0% of the observed value. If you get percent limits of 10% or more, this method become less accurate, so you have to use log transformation. But don't worry, it's all taken care of in the spreadsheet.

The factors shown in the table are values of the t statistic for the given probability. The factors get a bit larger for typical errors based on smaller sample sizes, reflecting more uncertainty about the magnitude of the typical error from smaller samples. For 20 subjects measured twice, the factors are accurate enough. If you assess subjects frequently, you should estimate the typical error of your measurement from a larger amount of retesting--otherwise you're likely to mislead all your subjects about the accuracy of their assessments through using an estimate of typical error that is much higher or much lower than the true typical error. See below for more on this issue.

The other way to take error into account when you assess a subject is to specify the likelihood (probability or odds) that the subject's true value is greater than (or less than) a reference value. This method is better for changes in a measurement between tests, but I'll illustrate it here with a simple example. If a skinfold thickness of 42 mm or more had some special significance, you could say to the gymnast "there's a 98% chance that your skinfolds are thicker than 42 mm", or "odds are 50 to 1 that your skinfolds are thicker than 42 mm". The probability and odds come straight from the first example shown on the spreadsheet.

Monitoring for a Change between Measurements
The uncertainty in a change between measurements is more than in a single measurement, because a change involves two measurements, each of which has error. But you double the variance, not the typical error, so the typical error in a change score is root2 times the typical error. The likely limits for a change in a measurement are therefore root2 times the limits for a single measurement. See the table above for the factors corresponding to the different likelihoods. I have incorporated these factors into the spreadsheet. For an example, let's measure our gymnast again, one month later. Her skinfolds were 45.2, but now they're 48.5 mm. The coach wants to know if she is really putting on fat. What do you tell the coach?

First, let's try likely limits. As before, let's assume the typical error is 1.5 mm. The easiest likely limits to calculate for a change score are the 50% limits: simply plus or minus the typical error. The observed change is 3.3 mm, so you'd say there's a 50% chance, or odds of 1:1, that the true change is between 3.3-1.5 and 3.3+1.5, or 1.8 and 4.8 mm. If we opt for a range that has odds of 4:1 of including the true change (an 80% likely range), the limits are 3.3 ±1.81x1.5, or 0.6 and 6.0 mm. And so on. Fine, but what percent limits should you use in these practical situations, and how do you use them to decide whether a real change has occurred? Rather than try to answer these hard questions, I will take you through a better method of assessing change.

The better method is based on calculating the likelihood that the true change is bigger than a reference value. For the reference value, you choose the smallest clinically important or worthwhile change. In the above example (observed increase of 3.3 mm, typical error of 1.5 mm), let's say that an increase in skinfolds of 2.0 mm is the smallest change worth worrying about. Obviously, the gymnast's observed change of 3.3 mm is already more than 2 mm, but how likely is it that the true change is more than 2 mm? From the spreadsheet, the likelihood is 73%, or odds of 3 to 1. We should also work out the likelihood that the gymnast's skinfolds have actually decreased (even though we observed an increase). The smallest worthwhile decrease would be 2.0. From the spreadsheet, the chance that the decrease has been greater than 2.0 (< -2.0) is only 1%, or odds of 1:136. Your advice to the coach? "Odds are 3 to 1 there's been a substantial increase in skinfold thickness, and there's a negligible chance that her skinfolds have decreased. You can assume she's fatter."

This example is reasonably clear cut, mainly because the typical error or noise (1.5 mm) is somewhat less than the smallest important change (2 mm). Basically, our measure is precise relative to any changes that matter, so any changes we observe with such a measure are trustworthy. But what if the noise is about equal to the smallest signal? The Powerpoint presentation has a couple of examples for an arbitrary variable with a typical error of 1.0 and a smallest important effect of 0.9. If the observed effect is 1.5, chances are 66% the true effect is clinically positive, 29% the true effect is clinically trivial, and 5% the true effect is clinically negative. It's reasonable to conclude the true effect is (probably) clinically positive. If the observed effect is a clinically trivial 0.5, the likelihood that the effect really is trivial is only 45%, whereas there's a 55% chance something really worthwhile has happened (39% positive, 16% negative). You can conclude that maybe nothing has happened, but acting on it would depend on the relative costs and benefits of taking action or doing nothing.

When the typical error is much greater than the smallest worthwhile change, we will often observe clinically worthwhile changes that are due to error of measurement rather than to any real change. The measure is therefore too noisy to be useful. The chances that real positive or negative changes have occurred (using the spreadsheet) confirm this state of affairs. For example, if the typical error is three times the smallest clinically worthwhile change, and we observe the smallest worthwhile change, the chance of a real positive change having occurred is 50%, or odds of 1:1, but the chance of a real negative change having occurred is 32%, or odds of 1:2..

Noisy measures can still be useful for characterizing worthwhile changes smaller than the noise, but we have to reduce the noise by performing multiple pre and post tests; we then either compare means of the pre and post tests or look for a trend across all the tests. On the other hand, observed changes greater than the typical error may still be trustworthy, if you expected them. In the present example, even a change equal to the typical error (three times the smallest worthwhile change) has likelihoods of a true positive value (68% or 2:1) or a true negative value (17% or 1:5) that would satisfy a practitioner who was expecting such a large change in the subject. But if true changes of such large magnitude are unlikely, we should be prepared to discount large observed changes as measurement error.

By basing our assessment partly on the change we think we're likely to see, we are assessing the individual in a Bayesian fashion. Bayesian analysis is a quantitative method for taking into account our prior belief about something, in this case the subject's true value or change in the true value. Experienced clinicians and practitioners adopt this approach qualitatively when they reject unlikely test results. Bayesian analysis ostensibly allows this kind of decision-making to be quantitative. But how can we quantify strength of a belief? For example, if we believe a change couldn't be outside �3, where does the �3 come from, and what likely limits define couldn't? 80%, 90%, 95%, 99%... ? At the moment I can't see a satisfactory answer to these questions, but whatever, I have included Bayesian adjustment for the likelihoods and likely limits in the spreadsheet. It took me so long to do, I'd hate to think the time was wasted!

Putting all these examples together with lots of deep thought, I came up with the bullet points at the start of this section on assessing an individual. Go back there now, read them again, and make sure you understand and learn them.

Some researchers have tried to use limits of agreement to make decisions about change in an individual. According to these researchers, you can trust an observed change only if it's greater than the limits of agreement. But limits of agreement are so big (2.8 typical errors) that clinically important trustworthy changes often fall within them. You end up having to ignore changes in your subjects that in some settings might be life-threatening! No, we must abandon limits of agreement as a clinical tool.

Comparing Individuals
All the above calculations for the change in a single subject's measurements also apply to making decisions about the difference between two subjects. In the above example, the second measurement of skinfold thickness (48.5 mm) could have been a measurement of skinfold thickness of another subject. Your conclusion would be that the second subject has skinfolds 3.3 mm thicker than the first, with odds of 4 to 1 that the real difference in skinfold thickness is between 0.6 and 6.0 mm. Better still, you could say that the odds of a real difference in skinfold thickness (more than 2 mm) are 3 to 1.

Spreadsheet for Assessing an Individual
In this spreadsheet I use the typical error of measurement and a subject's observed value to estimate likely limits for the subject's true value and to estimate the likelihood that the subject's true value is greater than a reference value. I do the same for the change between two observed values. I also include likelihoods and likely limits for the estimate of a true criterion value derived from a validity study. Finally, I've gone to a lot of probably pointless trouble to add Bayesian adjustments in a second spreadsheet (part of the same file).

Precision of the estimate of a subject's value: Excel latest

Help!

Comparing Reliability of Measures
Choosing between two items of laboratory equipment, choosing a good performance test or test protocol, deciding whether an anthropometrist has a reached a certain level of skill... these are all applications where you need to compare reliability of the measures produced by the equipment, the performance tests, or the anthropometrist. Recall that we have three main measures of reliability: change in the mean, typical error, and retest correlation. Which of these should you use when comparing the reliability of items of equipment, tests, anthropometrists, and so on?

Systematic changes in the mean can be an issue when comparing measures: in general, the bigger the changes between trials, the less desirable the measure. But comparing the typical errors is much more important, because the equipment, protocol, or anthropometrist that produces a measure with less typical error is providing a more accurate measure. Retest correlation contains the typical error, but the fact that it also contains the between-subject standard deviation makes the comparison of correlations either noisy (when there are different subjects in the two reliability studies) or computationally difficult (when the same subjects are in both studies). Besides, there is no point in comparing retest correlations, if you have already compared typical errors. I therefore will not deal with comparison of retest correlations.

When setting up a study to compare typical errors, keep in mind that the typical error always consists of biological variation arising from the subjects and technological variation arising from the items. The aim is usually to compare the technological variation, so try to make the biological variation as small as possible. For example, when comparing the reliability of two anthropometrists, you would get them to measure the same subjects within an hour, to avoid any substantial biological variation. Similarly, when comparing the reliability of power provided by two ergometers, use athletes as subjects, because they are usually more reliable than non-athletes.

Comparing the reliability of two items (protocols, equipment, or operators) is straightforward when different subjects are used to get the reliability for each item. Confidence limits for the ratio of the typical errors between corresponding trials in the two groups can be derived from an F ratio. Use Item 4 in the spreadsheet for confidence limits for this purpose. To compare changes in the mean between corresponding pairs of trials for the two measures, you will need to use an unpaired t test of the change scores. Using the same subjects has more power but requires analysis by an expert. (The analysis needs a mixed model, in which the equipment is a fixed effect, trial number is a fixed effect, subjects is a random effect, and a dummy random variable is introduced to account for the extra within-subject variance associated with measures on one of the items. Confidence limits for the extra variance tell you how different the typical errors could be. The model also provides an estimate of the difference in changes in the mean between the items, or you can use a paired t test.)

In the previous section I said that the 95% likely range is too conservative for assessing individuals, and I also said that it's difficult to decide on what percent range to use. The same argument and difficulty applies for comparison of typical errors in a clinical or field setting. It won't hurt to calculate, say, the 80% likely range, but I think clinicians and practitioners (and you!) will have a better chance of understanding what I'm getting at if you use likelihood that one typical error is substantially smaller or larger than the other. You compare typical errors by dividing one by the other, to get a ratio. A ratio of 1.1 or maybe 1.2 is my best guess at the minimum worthwhile difference in reliabilities, so you calculate the likelihood (as a probability or odds ratio) that one measure has a typical error at least 1.1x (or 1.2x) bigger than the other. It's all on the spreadsheet for confidence limits.

Sample Size for Reliability Studies
How many subjects and retests do you need in a reliability study? That depends on how precise you need to be with your estimate of reliability of the measure. That, in turn, depends on how reliable the measure itself turns out to be: the higher the reliability, the less precise the estimate of reliability needs to be, so the fewer the number of subjects or retests you will need. Let me explain this principle with an example of assessment of individuals. Suppose the variable in question is some measure of human performance. Suppose the smallest change in performance that matters to subjects is 2.0 units (seconds, cm, kg, %, or anything you like). If your measure of performance turns out to have a typical error of 0.2 units in a reliability study, a 50% uncertainty in this estimate (that is, a factor of 1.5, or 0.2/1.5 to 0.2x1.5, or 0.1 to 0.3 units) makes little difference to the precision of estimates of small changes in performance (~2.0 units). I mean, it doesn't matter too much if an estimate of a change in performance of 2.0 units is accurate to ±0.1 or ±0.3 units. But if the typical error turns out to be 3.0 units (that is, similar to the smallest change in performance that matters to subjects), a 50% uncertainty in the typical error (2.0 to 4.5 units) makes a big difference to the precision of estimates: 2.0 ± 2.0 units isn't very good, but it's a lot better than 2.0 ± 4.5 units. In other words, when the typical error is similar in magnitude to what matters for your subjects, your uncertainty in the typical error needs to be a lot smaller than 50%, and that means more subjects in the reliability study.

Phew! Let's see what sample size we'll need for the estimate of reliability for each application of reliability. The applications are: estimating sample size for an experiment, comparing reliability of different measures, estimating individual responses in an experiment, and assessing an individual. We'll assume modest reliability: a typical error of the same order of magnitude as the smallest change that matters to subjects. We'll find that samples of 50 subjects tested three times gives reasonable precision for the estimate of the typical error. That's assuming you can combine the data for all three trials to estimate the typical error. If there is a substantial learning or practice effect on the typical error between the first and second trials, you will need another trial--four in total--so you can combine the last three.

Sample Size for Reliability Studies...
...for Estimation of Sample Size for an Experiment
When you use a value of the typical error to estimate the sample size for an experiment, uncertainty in the typical error translates into uncertainty in the sample size you will need for the experiment. Sample size for an experiment is inversely proportional to the square of the typical error, so uncertainty in the typical error balloons into much bigger uncertainty in sample size for an experiment. You can check the effect of number of subjects and retests on precision of the typical error by plugging numbers into the appropriate cells of Item 3 on the spreadsheet for confidence limits. Give the typical error a value of 1.0 then pretend you got this value from a reliability study of either 10 subjects tested twice (= 9 degrees of freedom). You will find that the 95% confidence limits for the true typical error are 0.69 to 1.83; square these and you get the uncertainty in sample size as factors of 0.47 to 3.33. In other words, if you predicted a sample size of, say, 40 subjects in the experiment on the basis of a typical error of 1.0, what you might really need is anything from 19 to 133. Well, that's far too wide a range! Let's try a reliability study with 50 subjects tested three times. The range in sample size becomes 31 to 54, which is still quite a lot of uncertainty, but I guess it's OK.

This calculation is based on 95% limits of uncertainty for the typical error, which may be a bit too conservative for the likely limits of the sample size in the experiment. If instead we use 67% likely limits, we end up with something more like the typical variation in the estimate of sample size based on the reliability study. For a reliability study of 10 subjects tested twice, the typical variation in our estimate of sample size would be, for example, 28 to 72. Still too wide. Test them three times and you get 30 to 59. That's better, but the required sample size could easily be outside these 67% limits.

So what's my advice? If you have the time, money, and subjects for a large reliability study, go for it. Otherwise you're better off devoting your resources to the experiment by using sample size on the fly: stop testing subjects when you have adequate precision for the effect.

Sample Size for Reliability Studies...
...for Comparing Reliability of Measures
When you want to compare the reliability of two measures, the worst-case scenario is that you observe similar reliabilities for the two measures. (You might see why this is worst-case in a minute.) In this scenario, you want to conclude that there are no substantial differences between the measures. The easiest way to compare typical errors is compute their ratio and its confidence limits. Therefore, you will be able to conclude there is no substantial difference if the upper limit of the ratio is only a little greater than 1.00 and the lower limit is only a little less than 1.00. Let's generate some confidence limits for the ratio using Item 4 (ratio of standard deviations) in the spreadsheet for confidence limits. Make the two typical errors the same (e.g. 3.0), and pretend each has come from a study with 100 degrees of freedom (51 subjects, 3 trials). You'll see that the 95% confidence limits for the ratio of the typical errors are 0.82 to 1.22. In other words, the true values of the typical errors could be about 20% different from each other. That amount of uncertainty is marginal, in my view, but again, 95% confidence limits are probably too stringent in a real-life situation where you are choosing between two items of equipment. The 80% confidence limits for the ratio are 0.88 to 1.14, which make me feel more comfortable about concluding that there is no real difference in the reliability of the two measures. I feel even more comfortable looking at the likelihood the the true value of the ratio is greater than 1.2: it's only 3%, or odds of 1 in 28. There is no substantial difference in the reliability of these two measures, if by "substantial" we mean different by a factor of 1.2 or more.

Things aren't so bad when you observe a big difference between the typical errors of the measures, because you will need less subjects to conclude that one really is substantially worse (larger) than the other. Try it for yourself with the spreadsheet: make the observed typical errors 2.0 and 3.0, give them both only 20 degrees of freedom, make the reference ratio 1.15, say, then look at the likelihood that one typical error is substantially greater than the other: 88%, or odds of 7:1. Not much doubt about it--they're different!

Finally, if you can use the same subjects for both reliability studies, you're bound to get better precision for the ratio and therefore a reduction in sample size required to make firm conclusions about the relative magnitudes of the typical errors. Sorry, I haven't worked out how big the reduction is yet. You can't do it with the spreadsheet--you have to use mixed modeling or bootstrapping.

Sample Size for Reliability Studies...
...for Estimating Individual Responses
Estimation of individual responses to a treatment boils down to a comparison of the typical errors of two groups (the treatment and control groups), so the sample size must be the same as for a comparison of the reliability of two measures..

Sample Size for Reliability Studies...
...for Assessing an Individual
At first glance it appears you can use as few as 20 subjects and two trials to estimate a typical error without substantially degrading the precision of an individual assessment. Check the spreadsheet for precision of a subject's true value to see what I mean. In Item 1, put in an observed value of 50, a typical error of 2.0 from two trials, and compare the likely limits for the subject's true value when the typical error is based on 20 subjects vs 2000 subjects. With 20 subjects the 80% likely limits for the subject's true value are 47.3 to 52.7, or 50 ± 2.7; for 2000 subjects the limits are 47.4 to 52.6, or 50 ± 2.6. In other words, there's a negligible increase in the likely limits (= loss of precision) for the smaller sample size. But wait a minute... the typical error based on a sample of 20 subjects and two trials is really noisy. Check the spreadsheet for confidence limits and you'll see, for example, that a typical error of 2.0 has 95% likely limits of 1.5 to 2.9. That's a big range in precision. What gives?

Well, 20 subjects and two tests definitely give you almost as much accuracy as a zillion subjects and tests, and that's fair enough if you are assessing only one individual. If another clinician tested 20 subjects twice, then assessed another individual, it would be the same story. But there's likely to be a big difference between your typical errors; for example, yours might be 2.5, and the other clinician's might be 1.7. Your assessments of, say, 80% likely limits based on a typical error of 2.5 would really be ~90% likely limits, while the other person's 80% likely limits based on a typical error of 1.7 would be ~70% likely limits. You're both giving misleading assessments, and so would many other clinicians who tested only 20 subjects twice. Yet averaged over all clinicians and all subjects, the true values of 80% of subjects would be within the likely limits that each clinician tells each subject. The trouble is that your assessments will be consistently misleading, if you are unlucky enough to get a typical error of 2.5 or 1.7 with your batch of 20 subjects. A typical error based on 50 subjects and three tests would usually be in the range of 1.8 to 2.2, and if you used 2.2 in your assessments, your 80% likely limits would be less than 85% limits in reality, which seems OK to me. But I'm still thinking about it...

Go to: Next · Previous · Contents · Search · Home

webmaster=AT=sportsci.org · Sportsci Homepage
Last updated 5 Aug 01

A New View of Statistics	©2001 Will G Hopkins
Go to: Next · Previous · Contents · Search · Home