SPORTSCIENCE |
|
Perspectives / Research Resources |
|
Understanding Statistics by Using Spreadsheets to Generate and Analyze
Samples
Will G
Sportscience 11, 23-36, 2007
(sportsci.org/2007/wghstats.htm)
Institute of Sport and Recreation Research, AUT University, Auckland 0627, New
Zealand. Email.
Reviewers: Bernadette Murphy, Faculty of Health Sciences, University of Ontario
Institute of Technology, North Oshawa, Ontario L1H 7L7; Patria Hume, Institute
of Sport and Recreation Research, AUT University, Auckland 0627, New Zealand.
Reliability and Error of Measurement
Validity and Error of the Estimate
Comparison of Means in Two Groups
Means in Two Groups Plus a Predictor
When I began to use an advanced statistical package, I found that I could come to terms with the output by analyzing data with an obvious effect that I had made up. This exercise also gave me a deeper understanding of statistics. In this article I explain how you can go through a similar process with spreadsheets.
In the beginning I used my imagination to get numbers representing an effect, such as a correlation between weight and height or a difference in mean IQ between boys and girls. I soon realized that it would be better to use the stats package itself to generate a known correlation or a known difference between means. If the analysis reproduced the known effect, I could be more confident that I was using the package correctly.
What did I mean by a known effect? There is a true value for an effect in a population, but when you do a study, you get only a sample value, which is never exactly the true value. So when you generate a sample containing an effect, the trick is to reproduce the behavior of real samples: the known effect is the true or population value, and you make up a sample that gives something like the true value. If you were to make up many samples, every sample would give a different value scattered around the true value. On average, the sample value would be the true value. Or you could make up a very large sample, in which case the sample value should be very close to the true value.
Of course, in real life you can never know the true value of an effect in a population, so when you analyze real data, you can't check if you've got the right answer. But with made-up data, you can. And it's not simply a matter of making lots of samples or a very large sample. With any statistical analysis of a sample, you have to derive the confidence interval for the effect. Do this with your made-up sample and you should find that the confidence interval includes the right answer–except that there's a small chance it won't: 10%, if you choose 90% for the level of your confidence interval. If you are confused at this point, you won't be when you build up some simple data from scratch. All these ideas will emerge naturally.
Let's now see how to make samples with a spreadsheet. In the following sections I will work my way through the accompanying workbook, which has a spreadsheet for each section. At the end of each section I list the concepts for which you should have developed an implicit understanding. Within each spreadsheet I have filled cells with the colors of the rainbow to represent the sequence in which the cells were created: red first through to violet and sometimes to grey and white. (My apologies to the color blind.) You'll probably learn more by recreating each cell's formula in an adjacent blank cell or in a new blank spreadsheet rather than by simply clicking your way through the filled cells.
At the heart of sampling is the notion of a random sample of a population. In Excel you access randomness via
the function
So, put or type =RAND() into a cell
(in upper or lower case, followed by pressing Enter, which I won't
bother to state again) and you will get a number between 0 and 1, such as
0.230592. That's only the first six decimal places of a number that has 15
decimal places, but we don't have to worry about all those extra digits. Each
digit is chosen randomly from 0 through 9. The result is a random number between
0 and 1. Put =
A number that ranges between 0 and 1 can be interpreted as a probability: values closer to 0
represent more unlikely, and values closer to 1 represent more likely. You can
interpret probability as the proportion of times you expect something to
happen. So let's have a bit of fun using IF and
Concepts: random, probability.
I will start with a sample that gives nothing more than some simple statistics–the mean and standard deviation (SD)–for a single continuous variable. Excel has a function, NORMSINV, that converts a probability into a value of a normally distributed variable drawn from a population with a mean value of 0 and an SD of 1. You don't have to understand what NORMSINV does; just give NORMSINV the probability generated by RAND, et voila, a value of a variable! The rest of this paragraph is for the clever or curious. NORM and S stand for the standard unit normal probability distribution, and INV stands for the inverse thereof. In fact, it is the inverse of the cumulative normal probability distribution: =NORMSINV(p) interprets p as a cumulative probability by returning a value of a normally distributed variable with mean of 0 and SD of 1, such that p is the proportion of the population with values less than the value it returns. (Draw a normal distribution and shade the area under the curve to the left of a value: that's p. Also, p here is not the p value of null-hypothesis tests.) Small values of p give large negative values of the unit normally distributed variable, p=0.5 gives the value zero, and values of p close to 1 give large positive values of the variable. p=0.025 and p=0.975 give -1.96 and 1.96, which you will recognize if you've done stats courses.
Let's generate that sample.
Use the layout in the spreadsheet as a guide to label and make a column of
Next, decide on a mean and SD (e.g., 400 and 50, as shown). The mean
is a measure of where the middle of the population falls, and the SD is a
measure of the scatter of the values about the middle. Generate an observed
value of peak power from these and the NORMSINV value using this equation: peak
power = mean + SD*(the NORMSINV value). Use the F4 key to freeze the cells for
mean and SD, then copy the cell down to generate your sample of 100 or so. Get
a feel for the way the
Each value we have generated is called an observation, short for the observed value of a variable. Notice that we generated the observations with only three things: a mean, a standard deviation, and a random number. Although the observations for different subjects are related by the mean and SD, the random number makes it impossible to say anything about the specific value of any one observation, given the value of any other observation. The observations are therefore said to be independent. Most analyses are based on the assumption that the values of each subject are independent of the values of all other subjects. Repeated measurements on the same subjects in a data set are, of course, not independent: you get a similar value whenever you re-measure the same subject, as we will see in the next spreadsheet.
Display a frequency histogram of the values as shown on the spreadsheet. (See instructions in the spreadsheet for this unbelievably clunky aspect of Excel.) With bigger and bigger sample sizes, the frequency histogram looks more and more like a classic normal distribution. What is a normal distribution, anyway? Just another one of those incredible manifestations of the laws of nature, which here make most variables have the same expected shape for their sampling distribution: their distribution of values. What's also incredible is that someone worked out its equation. Of course, real variables do not have perfect normal distributions, but they don't have to for statistical analyses.
Now for something more advanced... Derive the sample mean, sample SD and sample size using the AVERAGE, STDEV and COUNT functions. Click the cursor in a blank cell somewhere, then hit Ctrl-D a few times and watch what happens to the sample values. What you are seeing is sampling variation. Notice how the sample values hover around the population value.
Finally, some really advanced statistics… Create a confidence interval (or confidence limits) for the mean and for the SD. A confidence interval represents an inference or conclusion about the population or huge-sample value of a statistic. When you publish a 90% (say) confidence interval for a statistic, you are saying to the world: "I did this study with a sample, so I can't be sure about the true or population value, but if I were to do the study with a huge sample, the value of the statistic would probably fall in here; by probably I mean there's a 90% chance."
The confidence interval is defined by the lower and upper confidence limits. Generate these for the mean using the TINV function and for the SD using the CHIINV function. These functions are based on the shape of the sampling distribution of values you get for the mean and the SD. The formula for the confidence limits for the mean is simple enough: mean ± t*SD/√(sample size), where t is the value of a t statistic (given by TINV) appropriate for the level of confidence and the sample size. The formula for the confidence limits for the SD is a bit more complicated, so get the formulae into your spreadsheet by copying my cells. Check that the formulae point to the right cells by double clicking and noting where the colored boxes fall. Drag the colored boxes to the right cells if they aren't right. Click on the cells below ±CL and ´¤¸CL to see how I derived the way of showing confidence limits as a single plus-or-minus number for the uncertainty in the mean and a single times-or-divide number for the uncertainty in the SD.
We seldom use confidence limits for a simple sample mean, but confidence limits for some kinds of SD are becoming more common. Regardless, I've got you to do some confidence limits here to show you what happens when you draw repeated samples. Do your population values of the mean and SD fall between their confidence limits? Hit Ctrl-D repeatedly and check each time. Do it until you get a sense that, yes, maybe one time in 10 the true limits do not enclose the true value. That's what a 90% confidence interval means: there's a 90% chance that it encloses the true value.
Play with the data. Try a 50% confidence interval to get a quicker sense of whether it really is 50%. Make the sample a lot smaller or bigger as follows: click the cursor in the far left of a row (not the first row of the sample) to highlight it, drag down to highlight multiple rows, then either right-click Delete… or right-click Copy and right-click Insert Copied Cells. (Do this operation below the level of the figure, to avoid problems.) Notice the effect of a change in sample size on the sampling variation and on the width of the confidence interval, but notice also with repeated use of Ctrl-D that the confidence interval stays true to its level.
Concepts: independent, observation, normal distribution, sampling distribution, population and sample mean and standard deviation, sampling variation, confidence interval or limits, inference.
Measurement error and the concept of reliability enter naturally at
this point. All we do here is tweak the values of a variable to make them more
like the values you would measure in reality. The tweaking consists of adding
two terms: another
Notice how each subject's second measurement is similar to his or her first. If you were to lump all the observations for both trials together, the observations would no longer all be independent: given one observation, you know there is a similar value for another observation, the observation from the same subject. Repeated measurements on the same subjects produce repeated measures (here, the variables represent the values in each trial) and each set of repeated measurements makes a cluster of observations (here, the pairs of measurements, representing a cluster for each subject).
The spreadsheet demonstrates that random measurement error can be estimated simply by deriving the SD for a single subject's repeated measurements: that's the usual meaning of measurement error. If you thought there was a shift in the mean between the first two or three trials, you would leave those trials out; including them would increase the apparent random error, which you can show in the spreadsheet by comparing the SD for Trials 3 to 10 with that for Trials 1 to 10. Increase the change in the mean in Trial 1 to make the effect more obvious, and use repeated sampling with Ctrl-D.
Most often we don't have enough repeated measurements to estimate the measurement error for each subject, so we assume it is the same for every subject and derive it by dividing the SD of the change scores by √2. This approach to estimation of error also neatly gets around the problem of any systematic change in the mean contributing to the error: by using change scores you estimate the change in the mean and thereby stop it contributing to the estimate of random error.
The √2 comes about as follows: there is independent random
error on both trials; when you subtract (or add) two variables with independent
random errors, the random error of the result is obtained by adding the
variances (squares of the SDs) and taking the square root; with equal errors,
the SD of the change scores is therefore √2 times the error we are trying
to estimate. Try using NORMSINV(
I have included another measure of reliability, the test-retest correlation between two trials. The CORREL function provides the value as the usual (Pearson) correlation coefficient. The retest correlation has to be really high–in excess of 0.96–for a measure to be good for assessing individuals, as explained in an article/slideshow (Hopkins, 2004), An intraclass correlation coefficient can also be calculated for reliability studies, but for two trials the Pearson and intraclass give practically identical values. When there are three trials or more, the intraclass correlation coefficient is a kind of average of all possible pairs of correlations between trials. Generally you do not need the intraclass correlation, because you should analyze reliability for consecutive pairwise trials to address the problem of changes in the mean and in the error of measurement with repeated testing. If any averaging is to be done, it should be the average of consecutive pairwise estimates, which usually have the same time between trials.
The various reliability statistics and their confidence limits can be generated with a reliability spreadsheet available at my stats site. Practice using that spreadsheet by copying in the numbers generated here.
Finally and most importantly, this spreadsheet embodies the concepts of linear modeling. A model is an equation linking predictor variables to a dependent variable. In a linear model, the predictors are simply added together; indeed, additive modeling would be a better term. I have produced each subject's values in each trial using an additive model: a true value for each subject that consists of a constant value (the overall mean) PLUS a value that differs randomly between subjects (characterized by the between-subject SD) PLUS a constant that is the same for each subject within a trial but differs between trials (the systematic error) PLUS an error that differs randomly for every subject (characterized by the random error SD). In this model the overall mean and the constant within a trial are fixed effects (because they don't change between some or all subjects) and we estimate them as means or mean differences or changes; the terms that differ from subject to subject or even within a subject are random effects (because they change randomly) and we estimate them as SDs.
Statistical analysis is simply an attempt to recover part or all of the model that generated the data. Here we create the model, and we can check if the analysis is working properly. In reality Nature creates the model, and we don't know what it is. We have to assume a model, then try to recover it.
In the past, stats packages focused on the fixed-effects part of the model and used a procedure called analysis of variance to estimate these effects. Only one random effect was permissible, although strategies and patches were devised to allow for more. Advanced stats packages allow you to specify and estimate all sorts of random effects, and the approach is called mixed modeling to reflect the mix of fixed and random. The computational procedure, estimation by maximum likelihood, is not easy to set up in Excel. The analyses in this and my other spreadsheets are therefore based on ANOVA, but I use strategies to allow for different random effects where required.
Concepts: random error, systematic error, repeated measures, clusters, test-retest correlation, linear model, predictor and dependent variables, fixed effect, random effect, mixed model.
In the previous section I developed a linear model by adding various terms to make the value of a dependent variable. I then used an analysis that attempted to recover the terms from a sample. You might be surprised to learn that almost all statistical analyses are based on recovering terms from linear models. Are most effects in the real world additive? No, so we have to find ways to make real effects into additive effects if we want to use linear models.
A large number of effects in biomedical science–I would say the majority–are multiplicative; that is, you should think about what's going on in terms of percent or factor differences or changes. For example, a treatment increases output power by 5% or a factor of 1.05, regardless of the individual's initial power output. Enter the logarithmic transformation. Logs turn multiplication into addition, so multiplicative effects become additive effects that can be analyzed with all the procedures developed for linear models. When you've done the analysis, you back-transform the outcome, either into a percent or a factor.
The log of a number is the power to which you raise a number called the base, which explains why multiplication becomes addition when you take logs. You can use any base for log transformation. Excel has logs for two bases: the LOG function uses base 10 (LOG(10) = 1, LOG(100) = 2,…), while LN uses exponential e and is known as the natural log. Use LN, because the back transformation is available as the function EXP (e to the power of…). If an effect or SD is y after LN transformation, the back transformation to a factor is EXP(y) and to a percent it is 100*(EXP(y)-1). I use 100*LN for the transformation, because the values of effects and SD in transformed units are already approximately percents, which sometimes makes it easy to see what's going on without having to back transform. The approximation is practically perfect when effects are <5%. The back-transformations for 100*LN are EXP(y/100) and 100*(EXP(y/100)-1).
The spreadsheet demonstrates the use of logs with simple numbers, then builds up some data for the effect of a treatment that has the same multiplicative effect for a few subjects. You will see that you get the same answer for the mean effect in the sample, whether you use the raw numbers or their logs (LN or 100*LN), but there is a big difference for the SD of the effect: the zero for the log-transformed variable reflects the fact that the effect is set up as a constant factor. We say that the effect is uniform across subjects after log transformation or that it is a uniform factor effect in the original data.
For this idealized example I did not include any error of measurement. When effects are multiplicative, the errors of measurement are also usually uniformly multiplicative; that is, you should think about them as constant percent or factor errors. Log transformation converts such errors into uniform additive errors. Analyses are not trustworthy when the errors are not uniform, so log transformation is important.
Concepts: percent or factor effect, log transformation, back transformation, uniform effect, uniform error.
The next spreadsheet generates data for another reliability study, this time with a variable that requires log transformation for linear modeling. You therefore assume that the changes in the mean and the errors of measurement happen as percent or factor effects. For variables that behave in this manner, the differences between subjects are also usually best expressed as a percent or factor SD.
A percent SD is also known as a coefficient of variation, or CV, and it is usually defined as the SD of the observed values expressed as a percent of the mean. I have converted the between-subject CV to a factor SD (1+CV/100), to emphasize that percent variations and effects are really factor variations and effects. I often use "´¤¸" in front of a factor SD, in the same way the we use "±" in front of the usual raw SD. Thus, 400 ´¤¸ 1.3 indicates that the values range typically from 400 ¸ 1.3 to 400 ´ 1.3. The values often fall outside these limits, which are "typical".
Should you report variation as a ± CV or as a ´¤¸ factor SD? There is no difference between +30% and ´1.3, but if you subtract 30% off a number you get something different from ¸1.3. The difference is negligible when the percent is <10, but -100% implies the resulting number is 0 (which is nonsense), whereas ¸2 implies the resulting number is half the mean (which is correct). For this reason I advocate showing CV only for CV of up to 20% or so, and I always keep in mind that what is implied by ±CV is really ´¤¸(1+CV/100). CV are definitely preferable to factors when the CV are small: you can understand ±5.3% and ±0.86%, but it's hard to get a sense of magnitude from the corresponding factor SD of ´¤¸1.053 and ´¤¸1.0086.
Now let's generate a sample, which turns out to be a bit
complicated. First, get true values by logging the mean and factor SD and
combining with a random value of a unit-normally distributed number given by
NORMSINV(RAND()), then back-transform. Next generate observed values by logging
the true values and the random and systematic errors expressed as factors,
combining with another random value given by NORMSINV(
I have called your attention in the spreadsheet to a discrepancy between the means for each trial calculated directly from the raw data vs the back-transformed means of the log-transformed data. The back-transformed mean gives an unbiased estimate of the population mean we started with, whereas the raw mean consistently overestimates the population mean. Similarly, the CV between subjects calculated from the logs is an unbiased estimate of the CV we used to generate the data, whereas the CV calculated by dividing the raw mean by the raw SD is biased low. The bias isn't noticeable when the between-subject CV is <20% or so, but it is easy to see with a CV of 100% or more. Verify these assertions by doing repeated sampling with Ctrl-D.
You can make the differences more obvious by increasing the sample size, which will reduce the sampling variation. Here's a trick for increasing the sample size: click and drag in the far left of the spreadsheet to highlight the entire rows from the second subject (Ariel) down to the last subject (Kade); copy, then right-click and Insert Copied Cells. Notice that the sample size increases to 39. Repeat this operation as often as you like with as many rows as you like, but never highlight the first row nor highlight beyond the last row.
The estimates derived from the logs are obviously correct, but there's a sense in which the raw mean and SD are also correct–these are, after all, the actual mean and the actual SD of the set of numbers. Even so, the log-derived mean and CV give a better idea of the way in which the original numbers are distributed. When a variable requires log transformation, the simple statistics summarizing the variable should be the back-transformed mean and the back-transformed CV or factor SD.
I've been focusing on the mean and between-subject SD, but the main point of this exercise was to derive the error of measurement as a CV or factor SD. Use Ctrl-D to verify that the sample values of the changes in the mean and the error of measurement hover around the population values in an apparently unbiased way. I haven't bothered with calculations for changes in mean and errors derived directly from the raw numbers, because it's not appropriate to do them that way with these data. However, with real data you have to decide whether to use raw or log-transformed values. You make the decision by viewing the scatter of change scores between two trials plotted on the Y axis against mean scores of the two trials for each subject. The SD of the scatter in the vertical direction is √2 times the error of measurement, so if the scatter is reasonably uniform across the range of mean values, the error is reasonably uniform. When a variable needs log transformation for the reliability analysis, the scatter of change scores of the raw values gets bigger for bigger mean values, but the scatter of change scores of the log values looks reasonably uniform.
Non-uniform scatter, or heteroscedasticity, is usually a problem only when the between-subject CV is ~20% or more. Anything less and you can't see any difference in the scatter of the raw or log change scores, and you get little difference in the estimate of the CV via log transformation or via the raw data. You can explore these assertions by generating data and putting them into the spreadsheet for analysis of reliability, where all calculations and plots are generated automatically.
Concepts: percent variation, coefficient of variation, factor SD, ±, ´¤¸, heteroscedasticity.
The next two spreadsheets exemplify another measurement concept, validity. In a validity study you establish the relationship between the values of a practical or new measure and the concurrent true or criterion values. The relationship is a calibration equation for adjusting or converting the practical measure into the criterion. The analysis also provides an estimate of the remaining or residual random error, known as the standard error of the estimate: that is, the error in the estimate of the criterion value from a practical value.
The relationship between the criterion and practical measures is almost always investigated with a simple linear model, a straight line. As with reliability, some measures have percent or factor relationships and errors and so need log transformation before modeling. The spreadsheets show an example of each.
To generate the data, I started with a sample of subjects having values of a practical measure drawn from a population of normally distributed values with a chosen mean and SD. I then generated the criterion values using the equation Criterion= a + b*Practical + error, where a is an intercept, b is a slope, and error is a random number drawn from a normal distribution with mean zero and SD equal to the standard error of the estimate. This equation gives the impression that the error is all in the criterion variable, whereas most of the error could be coming from the practical. Nevertheless, the use of an equation in this form results in an unbiased estimate of the criterion from the practical, and the resulting standard error of the estimate is also an unbiased estimate of the random error in the prediction. Failure to understand this issue has resulted in generations of researchers thinking that other kinds of modeling are required when there is error in a predictor variable.
Once you have specified the means and SD of the two variables and the standard error of the estimate, you can calculate directly the correlation between the variables. The formula for the population correlation is a bit complex, but it amounts to the square root of the fraction of variance in the Y variable explained by the X variable. The sample correlation has the same underlying formula, but you can get the value using the CORREL function. It is also possible to generate data for a straight line by specifying the population correlation rather than the population standard error of the estimate, as explained in a later section.
The spreadsheets include graphs or scatter plots of the two variables, with the best-fitting straight lines. Lines of identity on the graphs, and estimates of bias in the practical measures, make sense only when the practical is in the same units as the criterion. You can develop a calibration equation for dissimilar measures, in which case the correlation coefficient can be a useful way to assess the practical measure. A correlation in excess of 0.98 represents the start of really good measures for assessing individuals, also explained in that same slideshow (Hopkins, 2004), but measures with lower validity correlations are still useful for selection of groups of people or for providing additional evidence for making decisions about an individual patient or client.
The spreadsheet for validity with variables having a percent or factor relationship shows graphs for the raw data, for the raw data with log scales, and for the log-transformed data. Having drawn the graph for the raw data, the best way to draw the others is to click-Ctrl-drag on it to make an identical copy, then click on the points on the copy; the data will be outlined in colored boxes, and you can simply drag the boxes to the new data, if required. The straight line in the raw graph becomes a curve when you change the axes to log scales, so double click on it to get the Format Trendline window, click on the Type tab and change the type to Power.
These graphs provide a beautiful illustration of non-uniformity of error with raw data that becomes uniform with log transformation or equivalently, log axes. Play with the values of the between-subject SD and the standard error of the estimate. You will find that non-uniformity becomes noticeable only when the between-subject SD and standard error of the estimate are greater than about 20%. (Non-uniformity becomes less noticeable the smaller the sample size.) To see non-uniformity with small values of the standard error of the estimate you need to view a plot of residuals vs predicteds, the last step with this spreadsheet. The residual is the difference between the observed value of the dependent variable (here the criterion) and the value of the dependent value predicted by the observed value. As such, the residuals represent the scatter about the line in the vertical direction, and the standard deviation of the residuals is actually the standard error of the estimate. Uniformity of the residuals is probably the most important assumptions in analysis of linear models, so inspection of a plot of residuals vs predicteds is important. Copy the raw data into the validity spreadsheet at newstats.org for analyses with confidence limits and plots of residuals vs predicteds.
Concepts: validity, calibration equation, standard error of the estimate, residuals vs predicteds.
One of the most common statistical analyses is a comparison of the means of two groups, such as the mean of some variable for girls vs boys. The groups are usually independent; that is, knowing the value of any individual in one group gives you no clue about the value of an individual in the other group. Lack of independence could occur if there were clusters in the sample, such as boys and girls drawn from school classes, and your analysis would have to reflect that. Here we will generate data for a simple comparison of means of independent groups and do a simple analysis for statistical significance of the difference using the so-called p value. I'd prefer not to teach you about the p value and statistical significance, but it might be another 50 years before researchers are finally weaned off this confusing approach to making inferences about true values, so here goes…
The p in p value stands for the probability of getting any value of the effect statistic more extreme than your observed value (whether positive or negative), if in reality there was no effect. A small p value therefore implies that it is unlikely there is no effect, although p does not actually represent the probability of no effect. (The probability of exactly no effect is actually zero, which is one of several fundamental problems with use of statistical significance.) The convention is to treat p<0.05 as a threshold for deciding that there is an effect, and the effect is then said to be statistically significant at the .05 or 5% level.
Calculate the p value for the comparison of the means in the spreadsheet using the TTEST function. A t test is a procedure for generating a p value based on the sampling distribution of the difference between two sample means. The sampling distribution is a t distribution, which is like a normal distribution, except that it isn't quite a normal distribution, because you have to use the sample standard deviation to define its spread, and the additional uncertainty flattens the distribution a little. Follow the prompts to select the two arrays of numbers, then always choose "2" for the 2-tailed option (because the p is for more extreme positive or negative values) and "3" for unequal variances (because the scatter of values in the two groups could be substantially different, and the sample size isn't usually big enough to decide for sure, and even if it were you should still assume unequal variances). If the resulting p value is really small, it will display in decimal exponential notation, for example 1.64E-06 (which means 0.00000164).
To understand the p value better, make the population mean values equal. Now hit Ctrl-D repeatedly and watch the p value. On average, one time in 10 it will be less than 0.10, and one time in 20 it will be less than 0.05. It follows that, if there was really no difference in the means, one time in 10 or 20 you would nevertheless declare there was a statistically significant difference at the 10% or 5% level. But what can you say about the real difference? That's where statistical significance is deficient and where confidence intervals are so much better.
At this point you can get some practice using a spreadsheet I devised to calculate confidence intervals from a p value. See an accompanying article in this issue of Sportscience (Hopkins, 2007a)