A New View of Statistics 

The effect on the magnitude of the relationship between variables is proportional to the validity correlations of each variable. For example, suppose you are interested in the relationship between physical activity and health, and suppose that the true underlying relationship corresponds to a correlation of 0.50. If your measure of physical activity has a validity correlation of 0.7, then in your study of health and physical activity you will observe a correlation of only 0.5x0.7, or 0.35 (plus or minus sampling error, of course). The sample size required to detect a degraded relationship is inversely proportional the square of the validity correlation coefficient of each variable in the relationship. In our example, 1/0.70^{2} = 2.0, so you have to double the number of subjects. That's bad news, because most psychometric and subjective behavioral measures appear to have validities of 0.7 at best. Objective measures taken on lab instruments or in the field usually have validities of 0.80.9 or better, so you can often ignore the effect of validity of such variables on the magnitude of effects and the required sample size. Go to the section on sample size for crosssectional studies for more information about the actual sample sizes you need.
Example: You measure a client's skinfolds. You dig around in the literature and find an estimation equation that was developed for predicting body fat as a percent of body mass (%BM) in a large number of subjects similar to your client. The client's predicted body fat is 26.4 %BM, and the typical error of the estimate for the equation based on a large sample of similar subjects is 2.1 %BM. From the table in the section on reliability, the factor to multiply by the typical error for an 80% likely range is 1.28, which makes the limits 26.4 ± 1.28x2.1, or 23.7 to 29.1. You say to the client: "Your predicted body fat is 26.4 %BM, but the odds are 4 to 1 that your true (DEXA) body fat is somewhere between 24 and 29 %BM." Use the spreadsheet to generate these limits, and also the likelihood that the client's true value is greater than some reference value. For example, the likelihood that her true body fat is greater than 25 %BM is 74%, or odds of 3 to 1.
How do we decide whether skinfolds or some other practical measure is trustworthy? There are three approaches: correlation of spontaneous changes, correlation of induced changes, and correlation of original variables. The reliability of the practical and criterion measures usually has to be taken into account, so the statistics get quite complex. That might explain why noone has yet published an adequate account of any of these approaches. I will therefore restrict this section to a qualitative overview.
Correlation of Spontaneous
Changes
The obvious way to see how well
changes in a practical measure track changes in a criterion measure
is to measure some subjects, wait long enough for spontaneous changes
to occur in some of them, measure them again, then plot the changes
in the criterion measure against changes in the practical measure. If
you get a very strong correlation (>0.95) you know the practical
measure is trustworthy. The trouble is, you usually get a low
correlation. Why? Because the real changes between measurements are
usually of the same order of magnitude as the noise (the typical
errors) in each measurement. The change scores for each measure
therefore have a big contribution from the typical errors, which are
random and uncorrelated, so the correlated true changes get lost in
the noise in your plot of the change scores. You can estimate what
the true correlation would be with the typical errors out of the
picture, but if the observed correlation is poor, you will need
hundreds of subjects to get enough precision for the estimate of the
true correlation to decide whether the practical measure is any
good.
Correlation of Induced
Changes
Another approach is to make large
changes happen by giving some kind of treatment to half your
subjects. You then see how well the practical measure tracks the
criterion measure in that half relative to the other half by
correlating the change scores of all the subjects together. Even if
you are successful in finding an effective treatment and subjects
willing to undergo the treatment, you will have validated the
practical measure only for changes induced by that particular
treatment. In other words, you still won't know whether the practical
measure is good for tracking spontaneous changes or changes brought
about by other treatments.
Correlation of Original
Variables
The third approach is to analyze
data from a standard validity or calibration study. If the
correlation between the practical measure and the criterion measure
is near enough to perfect (>0.95), the two measures are
effectively identical, so changes in the practical measure must track
changes in the criterion. All the previous remarks about the
correlation between change scores apply to the correlation between
raw scores: the observed correlation will usually be a lot less than
0.95, because the correlation between the true values of the
practical and criterion measures is degraded by the typical errors;
you can estimate the true validity correlation by taking the
concurrent retest reliability correlations into account; the true
correlation needs to be greater than 0.95; and if the typical errors
have a large degrading effect on the correlation, you will need
hundreds of subjects in the validity and reliability studies to make
a firm conclusion. You also need a reasonably good validity
correlation to start with, which you won't get if your subjects are a
homogeneous subgroup. Another problem is that even the true
correlation between the measures may turn out to be less than 0.95,
yet the practical measure will still track changes well. For example,
the amount of nonfat tissue in skinfolds might vary between
individuals with the same body fat (resulting in a relatively poor
correlation between skinfolds and body fat), but the amount of
nonfat tissue might not change with hydration status (so changes in
skinfolds will still mirror changes in fat). This problem does not
arise with the first two approaches, because the constant amount of
nonfat tissue in each subject's skinfolds disappears from the change
in skinfolds.
Each of these three approaches has its strengths and weaknesses. The third approach is best for a heterogeneous group of subjects, but only if it produces a very high and precise estimate of the true correlation. If the group is homogeneous, or if the true correlation is poor, you will have to use one of the two changescore approaches. Inducing changes with an appropriate treatment may give you a good estimate of the correlation between the change scores, but you end up validating the practical measure only for the treatment you used. The greatest strength of the first approach is that it validates the practical measure for tracking the changes that occur in the normal course of events, but the validation won't be clear cut if the changes are too small.
When there are several independent variables (regressors) in the prediction equation, an important consideration is ensuring that the typical error is uniform across the range of the regressor (or between subgroups represented by the regressor). Extrapolating from what I said about sample size for comparison of typical errors of measurement, I suggest adding 100 subjects for each extra regressor. (After all, if there are substantial differences in the typical error of the estimate between subgroups, and if the differences are resistant to transformation, you will have to perform separate analyses for each subgroup, each of which will require 100 subjects.) Many published validity studies with multiple regressors have involved several hundred subjects, but I don't think the choice of sample size in those studies was driven by consideration of uniformity of error. Another important consideration is keeping the newprediction error from increasing substantially. It's easy to show (using Item 3 of the spreadsheet for a subject's true value) that increasing the number of subjects by 50 for each regressor after the first will ensure the newprediction error is no more than 1% larger than the typical error. No worry there, if you use 100 subjects per regressor.