A New View of Statistics

© 2002 Will G Hopkins

Go to: Next · Previous · Contents · Search · Home

Generalizing to a Population:

I've split complex models up into two main groups: more than one predictor (independent) variable, starting on this page, and repeated-measures models later on. In between there's a short page on more than one dependent variable, and variables of uncertain status. All details about model fitting on the previous few pages apply to all these models.
 More Than One Predictor (Independent) Variable
In other words, models like :

weight <= height  sex

This model is called an analysis of covariance (ANCOVA) when one predictor variable is numeric (height) and the other is nominal (sex). Covariance refers to the fact that height "co-varies" with the dependent variable, so height is also known as a covariate. Other names for models with two or more predictor variables include multiple linear regression when all variables are numeric and two-way analysis of variance (or three-way ANOVA etc) when all are nominal. In essence they are all the same. Before we go into each model in detail, let's understand what it means to have more than one predictor variable. Let's stay with the above example.

What the Model Means
It's easiest to think about the model as a tool for predicting weight when you know a person's height and sex. If there IS a relationship between weight and height, then knowing a person's height will tell you something about his or her weight. Similarly, if there IS a relationship between weight and sex, then knowing a person's sex will also allow you to say something about her or his weight. And if you know both height and sex, you'll be able to be even more specific about weight. So that's the question that the overall model poses: what do the predictor variables taken together tell you about the dependent variable?

Stats programs can calculate the usual goodness-of-fit R2 for the model, which you can interpret as a measure of how much the independent variables tell you about the dependent variable. In formal terms the R2 is the percentage of the variance in the dependent variable explained or predicted by the independent variables. You can also get a test statistic for the full model and its associated p value. You could use the p value to work out confidence limits for the overall R, but otherwise these statistics aren't worth worrying about. Much more important are effects derived from the predictor variables, as I will describe now.
"Controlling" for Something
The overall relationship is seldom the main focus when you have more than one predictor variable in the statistical model. Instead, these models are used to address a much more important question: what is the effect of something when we take into account something else? It's such an important concept, statisticians have some jargon for it: they talk about controlling for something, adjusting for something, or partialing something out. For example, what is the effect of sex on weight, when we take height into account? Think about it. Boys are heavier than girls, but boys are taller than girls, and taller people are heavier, so if we take into account the difference in height between boys and girls, is there any "real" difference in weight between them? A trivial question here maybe, but not if your outcome variable is an athlete's performance or health, and you control for time spent training before you look at the effect of sex or sport or region or whatever. And of course, it's also important to know about the effect of training on performance or injuries when you take into account differences between sexes or sports or regions.

What do we really mean when we control for height or take height into account in the comparison of the weights of boys and girls? Simply this: if boys and girls had the same height, what would be the difference in weight? And that's exactly what the statistical analysis tells us: it gives us the effect of a predictor with all other predictors held constant. When you do your usual estimates or contrasts for the effects you're interested in, or inspect the solution of the model, the answers you get are automatically adjusted for the presence of all the other predictor variables, as if they are all set to come constant value. For example, you get the difference in the mean weight of boys and girls who have the same (mean) height. Note that the analysis automatically controls for every predictor variable, so you can also address the question: what's the effect of height on weight when you take sex into account? What you get from the analysis for this question is the average slope of the lines for the boys and the girls, as if there was an equal number of boys and girls in the study. I'll delve into these issues more on the next page.

Why do the estimates for a given predictor represent the effect of the predictor with all other predictors in the model held constant? I'm not sure of the best way to answer this question. I've satisfied myself by considering that a linear model with two numeric predictor variables represents a plane in 3-D space. The stats program finds the least-squares plane of best fit. With a bit of thought and 3-D doodling I was able to see how the value of the coefficient of each variable is the "slope" for that variable with the other predictor variable held constant.
Mechanism Variables and Confounders
In the above example, suppose we adjust or control for height and find no substantial difference in the mean weight of boys compared with that of girls. Is it therefore reasonable to say that differences in height are responsible for the differences in weight between boys and girls? Yes! In fact, I call height a mechanism variable for the effect of sex on weight: sex affects height, and height affects weight. Any variable (here, height) on the causal path between the predictor (sex) and the dependent (weight) will reduce the effect of the predictor on the dependent when that variable is included as a covariate in a multiple linear regression. So if you see such a reduction, the covariate could be a mechanism variable. If you don't see a reduction, the covariate can't be a mechanism variable. A reduction in the effect is necessary but not sufficient for the covariate to be a mechanism.

Some researchers also call height a confounding variable or a confounder in the relationship between sex and weight. When you use the word confounder to describe height, you are implying that it sort-of makes the difference between boys and girls seem bigger than it really is. Boys are heavier than girls, of course, but height is confounding (or even compounding) the difference. There might be no difference when you take account of height. In fact, girls might even be heavier than boys. Fair enough, but the word confounder should be reserved for a different kind of covariate, one that has or could have a causal effect jointly on the predictor and the dependent. Let's consider another example to make the point clear. Consider the effect of physical activity on health in a cross-section of the population. Do the analysis without regard to the age of the subjects and you will find a really strong relationship. Cool, jobs for exercise professionals! Now control for age and you will find the relationship gets a lot weaker. Curses! It's likely that age is the real cause of most of the relationship between activity and health: age reduces physical activity and age reduces health. We say that the effect of physical activity on health is confounded by age. It's only when we control for age that we see the effect of differences in activity on the health of people of the same age.

What happens in the above example if we make age the predictor variable and physical activity the covariate? Age on its own will have a strong effect on health, but control for physical activity and you will find the relationship gets a lot weaker. So, you would be justified in regarding physical activity as a possible mechanism for the effect of age on health. Wow, that's cool again! Whether the effect of physical activity on health is really causal or just coincidental cannot be resolved with cross-sectional data. You have to do interventions and a repeated-measures analysis to sort that out. I explain how to include a mechanism variable as a covariate in such analyses later on.
I now have to introduce you to another fearful challenge: interactions. Let's just have a look first, then we'll climb it in several ways on the next few pages.

Height has an overall effect on weight, and sex has an overall effect on weight. But maybe the effect of height on weight is a bit different for boys than for girls: maybe being taller has a bigger effect on weight for boys than for girls. We show that in the model with the so-called interaction term, which is represented by multiplying height and sex together:

weight <= height  sex  height*sex

This will all make sense when we deal with the specific models. Meanwhile one more bit of jargon. Height and sex are called main effects, to distinguish them from the interaction term. When you have more than two main effects, you can have more than one interaction. When you have all the different combinations of the effects, including the interactions, you have what's called a full model.

A Warning!
There are several traps for the unwary when you have more than one predictor variable. Read the following pages carefully or you might jump to wrong conclusions with your data.

Go to: Next · Previous · Contents · Search · Home
Last updated 22 June 02