New View of Statistics: Validity

A New View of Statistics

© 2000 Will G Hopkins

Go to: Next · Previous · Contents · Search · Home

Summarizing Data:
PRECISION OF MEASUREMENT continued

APPLICATIONS OF VALIDITY
The applications are: tweaking up the sample size for a cross-sectional study, assessing an individual to predict her/his criterion value, comparing the validity of measures (to select a good one), and deciding whether a measure is valid enough for monitoring for changes in an individual's criterion value. I also consider sample size for validity studies on this page.

Sample Size for a Cross-Sectional Study
Just as reliability affected sample size in experimental or longitudinal studies, validity impacts sample size in descriptive or cross-sectional studies. In such studies, you measure each variable only once, and your outcomes are relationships between the variables. The lower the validity, the more the relationships are degraded, so the bigger the sample size you need to characterize them. For this application it's easier to discuss the effects of validity by considering the validity correlation rather than the typical error of the estimate.

The effect on the magnitude of the relationship between variables is proportional to the validity correlations of each variable. For example, suppose you are interested in the relationship between physical activity and health, and suppose that the true underlying relationship corresponds to a correlation of 0.50. If your measure of physical activity has a validity correlation of 0.7, then in your study of health and physical activity you will observe a correlation of only 0.5x0.7, or 0.35 (plus or minus sampling error, of course). The sample size required to detect a degraded relationship is inversely proportional the square of the validity correlation coefficient of each variable in the relationship. In our example, 1/0.70² = 2.0, so you have to double the number of subjects. That's bad news, because most psychometric and subjective behavioral measures appear to have validities of 0.7 at best. Objective measures taken on lab instruments or in the field usually have validities of 0.8-0.9 or better, so you can often ignore the effect of validity of such variables on the magnitude of effects and the required sample size. Go to the section on sample size for cross-sectional studies for more information about the actual sample sizes you need.

Assessing an Individual
When you use a prediction equation to estimate a criterion value from a practical value (e.g., body fat from a sum of skinfolds), you should take into account the typical error of the estimate in much the same way as you do the typical error of measurement for a single measurement. You use the same factors to generate the likely range of the predicted value of the criterion (the factors for a single measurement in the table), but you multiply them by the typical error of the estimate. If the typical error is based on a study of less than 50 subjects, you will need to use a new-prediction error instead of the typical error, as explained earlier. The calculations are in the appropriate section of the spreadsheet for assessing an individual.

Example: You measure a client's skinfolds. You dig around in the literature and find an estimation equation that was developed for predicting body fat as a percent of body mass (%BM) in a large number of subjects similar to your client. The client's predicted body fat is 26.4 %BM, and the typical error of the estimate for the equation based on a large sample of similar subjects is 2.1 %BM. From the table in the section on reliability, the factor to multiply by the typical error for an 80% likely range is 1.28, which makes the limits 26.4 ± 1.28x2.1, or 23.7 to 29.1. You say to the client: "Your predicted body fat is 26.4 %BM, but the odds are 4 to 1 that your true (DEXA) body fat is somewhere between 24 and 29 %BM." Use the spreadsheet to generate these limits, and also the likelihood that the client's true value is greater than some reference value. For example, the likelihood that her true body fat is greater than 25 %BM is 74%, or odds of 3 to 1.

Comparing Validity of Measures
Just as the typical error of measurement was the best measure for comparing reliability of instruments, operators, or protocols, the typical error of the estimate is the best measure for comparing their validity, Do not compare the new-prediction errors, however derived: these are appropriate only for assessing individuals. As I explained with comparing measures of reliability, use the spreadsheet for confidence limits to calculate 80% or 90% likely ranges for the ratio of typical errors determined with different subjects, and to get likelihoods for the true ratio being greater that a reference ratio. Get an expert to use mixed modeling to estimate likely ranges when the same subjects are used to determine the typical errors.

Validity for Monitoring Changes
Our discussion of validity thus far has been concerned with the validity of a single measurement on an individual. But we often use a practical measure to monitor for changes in a criterion measure. For example, we use changes in skinfolds to infer that there have been changes in a subject's body fat. You might think that changes in skinfolds would have to reflect changes in body fat, but what if the amount of non-fat tissue in a skinfold is affected substantially by the subject's state of hydration or the menstrual cycle? In this situation a change in skinfold thickness may or may not represent a change in body fat, so skinfold thickness would no longer be a trustworthy measure for tracking body fat.

How do we decide whether skinfolds or some other practical measure is trustworthy? There are three approaches: correlation of spontaneous changes, correlation of induced changes, and correlation of original variables. The reliability of the practical and criterion measures usually has to be taken into account, so the statistics get quite complex. That might explain why no-one has yet published an adequate account of any of these approaches. I will therefore restrict this section to a qualitative overview.

Correlation of Spontaneous Changes
The obvious way to see how well changes in a practical measure track changes in a criterion measure is to measure some subjects, wait long enough for spontaneous changes to occur in some of them, measure them again, then plot the changes in the criterion measure against changes in the practical measure. If you get a very strong correlation (>0.95) you know the practical measure is trustworthy. The trouble is, you usually get a low correlation. Why? Because the real changes between measurements are usually of the same order of magnitude as the noise (the typical errors) in each measurement. The change scores for each measure therefore have a big contribution from the typical errors, which are random and uncorrelated, so the correlated true changes get lost in the noise in your plot of the change scores. You can estimate what the true correlation would be with the typical errors out of the picture, but if the observed correlation is poor, you will need hundreds of subjects to get enough precision for the estimate of the true correlation to decide whether the practical measure is any good.

Correlation of Induced Changes
Another approach is to make large changes happen by giving some kind of treatment to half your subjects. You then see how well the practical measure tracks the criterion measure in that half relative to the other half by correlating the change scores of all the subjects together. Even if you are successful in finding an effective treatment and subjects willing to undergo the treatment, you will have validated the practical measure only for changes induced by that particular treatment. In other words, you still won't know whether the practical measure is good for tracking spontaneous changes or changes brought about by other treatments.

Correlation of Original Variables
The third approach is to analyze data from a standard validity or calibration study. If the correlation between the practical measure and the criterion measure is near enough to perfect (>0.95), the two measures are effectively identical, so changes in the practical measure must track changes in the criterion. All the previous remarks about the correlation between change scores apply to the correlation between raw scores: the observed correlation will usually be a lot less than 0.95, because the correlation between the true values of the practical and criterion measures is degraded by the typical errors; you can estimate the true validity correlation by taking the concurrent retest reliability correlations into account; the true correlation needs to be greater than 0.95; and if the typical errors have a large degrading effect on the correlation, you will need hundreds of subjects in the validity and reliability studies to make a firm conclusion. You also need a reasonably good validity correlation to start with, which you won't get if your subjects are a homogeneous subgroup. Another problem is that even the true correlation between the measures may turn out to be less than 0.95, yet the practical measure will still track changes well. For example, the amount of non-fat tissue in skinfolds might vary between individuals with the same body fat (resulting in a relatively poor correlation between skinfolds and body fat), but the amount of non-fat tissue might not change with hydration status (so changes in skinfolds will still mirror changes in fat). This problem does not arise with the first two approaches, because the constant amount of non-fat tissue in each subject's skinfolds disappears from the change in skinfolds.

Each of these three approaches has its strengths and weaknesses. The third approach is best for a heterogeneous group of subjects, but only if it produces a very high and precise estimate of the true correlation. If the group is homogeneous, or if the true correlation is poor, you will have to use one of the two change-score approaches. Inducing changes with an appropriate treatment may give you a good estimate of the correlation between the change scores, but you end up validating the practical measure only for the treatment you used. The greatest strength of the first approach is that it validates the practical measure for tracking the changes that occur in the normal course of events, but the validation won't be clear cut if the changes are too small.

Sample Size for Validity Studies
As with reliability, sample size for estimation of validity is dictated by the need for precision. In this case precision of the typical error of the estimate or the new prediction error is the main consideration. You don't have the option of performing more than two tests; instead, you have to get adequate precision by increasing the number of subjects. For a reliability study involving a noisy measure, I recommended a minimum of 50 subjects tested three times. In terms of degrees of freedom (which dictate the precision of estimates of typical error), that is equivalent to about 100 subjects tested twice, so that is the preferred minimum sample size for a validity study of a noisy practical measure.

When there are several independent variables (regressors) in the prediction equation, an important consideration is ensuring that the typical error is uniform across the range of the regressor (or between subgroups represented by the regressor). Extrapolating from what I said about sample size for comparison of typical errors of measurement, I suggest adding 100 subjects for each extra regressor. (After all, if there are substantial differences in the typical error of the estimate between subgroups, and if the differences are resistant to transformation, you will have to perform separate analyses for each subgroup, each of which will require 100 subjects.) Many published validity studies with multiple regressors have involved several hundred subjects, but I don't think the choice of sample size in those studies was driven by consideration of uniformity of error. Another important consideration is keeping the new-prediction error from increasing substantially. It's easy to show (using Item 3 of the spreadsheet for a subject's true value) that increasing the number of subjects by 50 for each regressor after the first will ensure the new-prediction error is no more than 1% larger than the typical error. No worry there, if you use 100 subjects per regressor.

Go to: Next · Previous · Contents · Search · Home

webmaster=AT=sportsci.org · Sportsci Homepage
Last updated 20 Aug 01

A New View of Statistics	© 2000 Will G Hopkins
Go to: Next · Previous · Contents · Search · Home