RegressIt: free Excel add-in for
linear regression and multivariate data analysis

Linear
regression analysis is the most widely used of all statistical techniques: it
is the study of *linear*, *additive *relationships
between variables, usually under an assumption of *independently and identically normally
distributed errors*.

The first
thing you ought to know about linear regression is how the strange term *regression*
came to be applied to the subject of linear statistical models. This type of
predictive model was first studied in depth by a 19th-Century scientist, Sir Francis Galton.
Galton was a self-taught naturalist, anthropologist, astronomer, and
statistician--and a real-life Indiana Jones character. He was famous for his
explorations, and he wrote a best-selling book on how to survive in the
wilderness entitled "The Art of Travel: Shifts and Contrivances Available in
Wild Places," and its sequel, "The Art of *Rough* Travel: From the
Practical to the Peculiar." They are still in print and still considered
as useful resources. They provide
many handy hints for staying alive--such as how to treat spear-wounds or
extract your horse from quicksand--and introduced the concept of the sleeping
bag to the Western World.

Galton was
a pioneer in the application of statistical methods to measurements in many branches
of science, and in studying data on relative sizes of parents and their
offspring in various species of plants and animals, he observed the following
phenomenon: a larger-than-average parent tends to produce a larger-than-average
child, but the child is likely to be *less* large than the parent in terms
of its relative position within its *own*
generation. Thus, for example, if the parent's size is** x** standard
deviations from the mean within its own generation, then you should predict
that the child's size will be** rx** (r times x) standard deviations from
the mean within the set of children of those parents, where **r** is a number *less than 1 in
magnitude*. (**r** is what will be defined below as the *correlation *between
the size of the parent and the size of the child.) The same is true of
virtually *any* physical measurement (and in the case of humans, most
measurements of cognitive and physical ability) that can be performed on
parents and their offspring. Here is
the first published picture of a regression line illustrating this effect, from
a lecture presented by Galton in 1877:

The R
symbol on this chart (whose value is 0.33) denotes the slope coefficient, not
the correlation, although the two are the same if both populations have the
same standard deviation, as will be shown below.

Galton
termed this phenomenon a *regression
towards mediocrity*, which in modern terms is a ** regression to the mean.** To a naïve observer this might
suggest that later generations are going to exhibit less variability--literally
more mediocrity--than earlier ones, but that is not case. It is a purely statistical phenomenon.
Unless every child is

Regression
to the mean is an inescapable fact of life. Your children can be *expected*
to be less exceptional (for better or worse) than you are. Your score on a
final exam in a course can be *expected* to be less good (or bad) than
your score on the midterm exam. A baseball player's batting average in the
second half of the season can be *expected* to be closer to the mean (for
all players) than his batting average in the first half of the season. And so
on. The key word here is "expected." This does not mean it's *certain*
that regression to the mean will occur, but that's the way to bet! More
precisely, that's the way to bet if you wish to minimize squared error.

We have
already seen a suggestion of regression-to-the-mean in some of the time series
forecasting models we have studied: plots of forecasts tend to be* smoother*--i.e.,
they exhibit less variability--than the plots of the original data. This is not
true of random walk models, but it is generally true of smoothing models and
other models in which coefficients other than a constant term are estimated.

The
intuitive explanation for the regression effect is simple: the thing we are
trying to predict usually consists of a predictable component
("signal") and a statistically independent *unpredictable*
component ("noise"). The best we can hope to do is to predict (only)
that part of the variability which is due to the signal. Hence our forecasts
will tend to exhibit less variability than the actual values, which implies a
regression to the mean.

Another
way to think of the regression effect is in terms of sampling bias. Suppose we
select a sample of professional athletes whose performance was much better than
average (or students whose grades were much better than average) in the first
half of the year. Presumably their numbers were way above average in part
because they were "good" and in part because they were "lucky."
The fact that they did so well in the first half of the year makes it probable
that *both* their ability and their luck were better than average during
that period. In the second half of the year they may be just as good, but they
probably will not be as lucky. So we should predict that in the second half
their performance will be closer to the mean. But meanwhile, many players whose
performance was merely average in the first half of the season will not be so
average in the second half, for better or worse, so the variability in
performance among all players will be just as large in the second half as in
the first. A nice discussion of
this effect in the broader context of social science research can be found here.

Now, why
should we assume that relationships between variables are ** linear**?

- Because
linear relationships are the
*simplest non-trivial relationships*that can be imagined (hence the easiest to work with), and.....

- Because
the "true" relationships between our variables are often at
least
*approximately*linear over the range of values that are of interest to us, and...

- Even
if they're not, we can often
*transform*the variables in such a way as to linearize the relationships.

This is a strong assumption, and the first step in
regression modeling should be to look at scatterplots of the variables (and in
the case of time series data, plots of the variables vs. time), to make sure it
is reasonable a priori. And after
fitting a model, plots of the errors should be studied to see if there are unexplained
nonlinear patterns. This is especially important when the goal is to make
predictions for scenarios outside the range of the historical data, where
departures from perfect linearity are likely to have the biggest effect. If you see evidence of nonlinear
relationships, it is possible (though not guaranteed) that transformations of
variables will straighten them out in a way that will yield useful inferences
and predictions via linear regression.

And why should we assume that the effects of different
independent variables on the expected value of the dependent variable are ** additive**? This is a

Many users just throw a lot of independent variables into
the model without thinking carefully about this issue, as if their software
will automatically figure out exactly how they are related. It
won’t! Even "automatic"
model-selection methods (e.g., stepwise regression) require you to have a good
understanding of your own data and to use a guiding hand in the analysis. They work only with the variables they
are given, in the form that they are given, and then they look only for linear,
additive patterns among them in the context of each other. * You* need to collect the
relevant data, clean it up if necessary, perform descriptive analysis to look
for patterns before fitting any models, and study the diagnostic tests of model
assumptions afterward (especially statistics and plots of the errors), and you
should also try to apply the appropriate economic or physical reasoning. Here
too, it is possible (but not guaranteed) that transformations of variables or
the inclusion of interaction terms might separate their effects into an
additive form, if they do not have such a form to begin with, but this requires
some thought and effort on your part.

And why
should we assume the *errors *of linear
models are ** independently and identically normally distributed**?

1. This assumption is often justified by appeal to the *central
limit theorem *of statistics, which states that the *sum** or average* of a sufficiently large
number of independent random variables--whatever their individual
distributions--approaches a normal distribution. Much data in business and
economics and engineering and the natural sciences is obtained by adding or
averaging numerical measurements performed on many different persons or
products or locations or time intervals. Insofar as the activities that generate
the measurements may occur somewhat randomly and somewhat independently, we
might expect the variations in the totals or averages to be somewhat normally
distributed.

2. It is (again) mathematically convenient: it implies that the
optimal coefficient estimates for a linear model are those that minimize the *mean
squared error* (which are easily calculated), and it justifies the use of a
host of statistical tests based on the normal family of distributions. (This
family includes the t distribution, the F distribution, and the Chi-square
distribution.)

3. Even if the "true" error process is not normal in
terms of the original units of the data, it may be possible to transform the
data so that your model's prediction errors are approximately normal.

But here too caution must be exercised. Even if the unexplained variations in
the dependent variable are approximately normally distributed, it is not
guaranteed that they will also be*
identically* normally distributed for all values of the independent
variables. Perhaps the unexplained
variations are larger under some conditions than others, a condition known as "heteroscedasticity". For example, if the dependent variable consists of daily
or monthly total sales, there are probably significant day-of-week patterns or
seasonal patterns. In such cases
the variance of the total will be larger on days or in seasons with greater
business activity--another consequence of the central limit theorem. (Variable
transformations such as logging and/or seasonal adjustment are often used to
deal with this problem.) It is also
not guaranteed that the random variations will be statistically
independent. This is an especially
important question when the data consists of *time series*: if the
model is not correctly specified, it is possible that consecutive errors (or
errors separated by some other number of periods) will have a systematic
tendency to have the same sign or a systematic tendency to have opposite signs,
a phenomenon known as "autocorrelation"
or "serial correlation".

A very important
special case is that of ** stock price data**, in which
percentage changes rather than absolute changes tend to be normally
distributed. This implies that over
moderate to large time scales, movements in stock prices are

Because the assumptions of linear regression (linear,
additive relationships with i.i.d. normally distributed errors) are so strong,
it is very important to test their validity when fitting models, a topic
discussed in more detail on the
testing-model-assumptions page, and be alert to the possibility that you
may need more or better data to accomplish your objectives. ("You
can’t get blood from a stone.") All too often, naïve users of
regression analysis view it as a black box that can automatically predict any
given variable from any other variables that are fed into it, when in fact a
regression model is a very special and very transparent kind of prediction box,
whose inner mechanism needs to be compared with reality in each situation where
it is applied.

A **variable**
is, by definition, *a quantity that varies*. (If it didn't vary, it would
be a constant, not a variable.) In fitting statistical models in which some
variables are used to predict others, what we hope to find is that the
different variables do *not* vary *independently* (in a statistical
sense), but that they tend to vary *together*.

In
particular, when fitting *linear* models, we hope to find that one
variable (say, Y) is varying as a *straight-line* function of another
variable (say, X). In other words, if all other possibly-relevant variables
could be held fixed, we would hope to find the *graph* of Y versus X to be
a straight line (apart from the inevitable random errors or "noise").

A measure
of the absolute amount of "variability" in a variable is (naturally)
its **variance**, which is defined as its *average squared deviation from
its own mean*. Equivalently, we can measure variability in terms of the **standard
deviation**, which is defined as the square root of the variance. The
standard deviation has the advantage that it is measured in the same units as
the original variable, rather than squared units.

Our task
in predicting Y might be described as that of "explaining" some or
all of its variance--i.e., *why*, or under what conditions, does it
deviate from its mean? Why is it not constant? That is, we would like to be
able to improve on the "naive" predictive model: Ý(t) =
CONSTANT, in which the best value for the constant is presumably the historical
mean of Y. More precisely, *we hope to find a model whose prediction errors
are smaller, in a mean square sense, than the deviations of the original
variable from its mean*.

In using *linear*
models for prediction, it turns out very conveniently that the *only*
statistics of interest (at least for purposes of estimating coefficients) are
the mean and variance of each variable and the **correlation coefficient***
*between each pair of variables. The coefficient of correlation between X
and Y is commonly denoted by **r _{XY}**.

The
correlation coefficient between two variables is a statistic that measures the *strength
of the linear relationship* between them, on a relative (i.e., unitless)
scale of -1 to +1. That is, it measures the extent to which a linear model can
be used to predict the deviation of one variable from its mean given knowledge
of the other's deviation from its mean at the same point in time.

The
correlation coefficient is most easily computed if we first **standardize**
each of the variables--i.e., express it in units of standard deviations from
its own mean. The standardized value of X will be denoted here by X^{STD},
and the value of X^{STD} in period t is defined as:

**X ^{STD}(t) = (X(t) - AVERAGE(X))/STDEV(X)**

where
AVERAGE(X) and STDEV(X) are the sample mean (average) and standard deviation of
X, in Excel notation. (I am going
to be a bit sloppy and use Excel notation rather than conventional math symbols
in some places to avoid problems with fonts.) For example, suppose that AVERAGE(X) =
20 and STDEV(X) = 5. If X(t) = 25, then X^{STD}(t) = 1, if X(t) = 10,
then X^{STD}(t) = -2, and so on.
Y^{STD} will denote the similarly standardized value of Y.

Now, **the
correlation coefficient is equal to the average product of the standardized values**
of the two variables:

**r _{XY}
= AVERAGE(X^{STD}Y^{STD}) = (X^{STD}(1)Y^{STD}(1)
+ X^{STD}(2)Y^{STD}(2) + ... + X^{STD}(n)Y^{STD}(n))/n**

... where
n is the sample size. Thus, for example, if X and Y are stored in columns on a
spreadsheet, you can use the AVERAGE and STDEV functions to compute their
averages and standard deviations, then you can create two new columns in which
the values of X^{STD} and Y^{STD} in each row are computed
according to the formula above. Then create a third new column in which X^{STD}
is multiplied by Y^{STD} in every row. The average of the values in the
last column is the correlation between X and Y. Of course, in Excel, you can
just use the formula **=CORREL(X,Y)** to
calculate a correlation coefficient, where X and Y denote the cell ranges of
the data for the variables.

If the two
variables tend to vary on the *same sides* of their respective means at
the same time, then the average product of their deviations (and hence the
correlation between them) will be *positive*, since the product of two
numbers with the same sign is positive. Conversely, if they tend to vary on *opposite*
sides of their respective means at the same time, their correlation will be *negative*.
If they vary *independently* with respect to their means--that is, if one
is equally likely to be above or below its mean regardless of what the other is
doing--then the correlation will be *zero*.

The
correlation coefficient is not only the average product of the standardized
values, but also: **the correlation coefficient is the "best"
coefficient for multiplying the standardized value of one variable in order to
predict the standardized value of the other**. That is, the "best"
linear model (in a minimum-squared-error sense) for predicting Y^{STD}
from X^{STD} turns out to be:

**Ý ^{STD} (t) = r_{XY} X^{STD}
(t)**

where **Ý ^{STD}** denotes the forecast for

In
graphical terms, this means that, on a scatter plot of Y^{STD} versus X^{STD},
the best-fit line for predicting Y^{STD} from X^{STD} is *the
line that passes through the origin and has slope* *r*_{XY}. This fact is not supposed to be obvious, but it is
easily proved by elementary differential calculus of several variables.

Here is a
graphical example: on a scatterplot
of Y^{STD} versus X^{STD}, the visual axis of symmetry is a
line that passes through the origin and whose slope is equal to 1 (i.e., a
45-degree line), which is the gray dashed line on the plot below. It passes through the origin because the
means of both standardized variables are zero, and its slope is equal to 1
because their standard deviations are both equal to 1. (The latter fact means
that the points are equally spread out horizontally and vertically in terms of
mean squared deviations from zero, which forces their pattern to appear roughly
symmetric around the 45-degree line if the relationship between the variables
really is linear.) However, the
gray dashed line is the not the best line to use for predicting the value of Y^{STD}
for a given value of X^{STD}.
The best line for predicting Y^{STD} from X^{STD} has a
slope of less than 1:* it regresses toward the X axis*. The regression line is shown in red, and
its slope is the correlation between X and Y, which is 0.46 in this case. Why is this true? Because, that’s the way to bet if
you want to minimize the root-mean-square error *measured in the Y direction*.
If instead you wanted to predict X^{STD} from Y^{STD} so
as to minimize root-mean-square error measured in the X direction, the line
would regress in the other direction relative to the 45-degree line, and by
exactly the same amount.

If we want
to obtain the linear regression equation for predicting Y from X in *unstandardized
terms*, we just need to substitute the formulas for the standardized values
in the preceding equation, which then becomes:

**(Ý(t) - AVERAGE(Y))/STDEV(Y)** = **r _{XY} (X(t) - AVERAGE(X))/STDEV(X)**

where **Ý** denotes the forecast for **Y**. If we now rearrange this equation and
collect constant terms, we obtain:

**Ý(t) = b _{0} + b_{1} X(t)**

where:

**b _{1}
= r_{XY} (STDEV(Y)/STDEV(X))** is the estimated slope of the regression
line, and

**b _{0}
= AVERAGE(Y) – b_{1} (AVERAGE(X))** is the estimated
Y-intercept of the line.

Notice
that, as we claimed earlier, the coefficients in the linear equation for
predicting Y from X depend only on the means and standard deviations of X and Y
and on their coefficient of correlation.

*Perfect* positive
correlation (r_{XY} = +1) or perfect negative correlation (r_{XY}
= -1) is only obtained if one variable is an *exact* linear function of
the other, without error. In such a case, one variable is merely a linear
transformation of the other--they aren't really "different" variables
at all!

In
general, therefore, we find less-than-perfect correlation, which is to say, we
find that r_{XY} is less than 1 in absolute value. Therefore our
prediction for Y^{STD} will typically be *smaller* in absolute
value than our observed value for X^{STD}. That is, we will always
predict Y to be closer to its own mean, in units of its own standard deviation,
than X* *was observed to be, which is Galton's phenomenon of regression to
the mean.

So, the
technical explanation of the regression-to-the-mean effect hinges on two
mathematical facts: (i) the correlation coefficient, calculated in the manner
described above, happens to be the coefficient that minimizes the squared error
in predicting Y^{STD} from X^{STD}, and (ii) the correlation
coefficient is never larger than 1 in absolute value, and it is only equal to 1
when Y^{STD} is an exact (noiseless) linear function of X^{STD}.

The term
"regression" has stuck and has even mutated from an intransitive verb
into a transitive one since Galton's time. We don't merely say that the
predictions for Y "regress to the mean"--we now say that we are
"regressing Y on X" when we estimate a linear equation for predicting
Y from X, and we refer to X as a "regressor" in this case.

When we
have fitted a linear model, we can compute its mean squared prediction error
and compare this to the variance of the original variable. As noted above, we
hope to find that the MSE is *less* than the original variance. The
relative amount by which the mean squared error is less than the variance of
the original variable is referred to as the *fraction* of the variance
that was *explained* by the model. For example, if the MSE is 20% less
than the original variance, we say we have "explained 20% of the
variance."

It turns
out that **in a simple regression model **(a linear model with only one *"*X"
variable), **the fraction of variance explained is precisely the square of the
correlation coefficient**--i.e., the square of r. Hence, the
fraction-of-variance-explained has come to be known as "R-squared."

In a *multiple*
regression model (a linear model with two or more *"*X" variables),
there are many correlation coefficients that must be computed, in addition to
all the means and variances. For example, we must consider the correlation
between *each* X variable and the Y variable, and also the correlation
between each *pair* of X variables. In this case, it still turns out that
the model coefficients and the fraction-of-variance-explained statistic can be
computed entirely from knowledge of the means, standard deviations, and
correlation coefficients among the variables--but the computations are no
longer easy. We will leave the details to the computer...

**Related topics:**

·
What to look for in regression
output

·
Testing the assumptions of
linear regression

·
What’s the bottom
line: how to compare models

·
Additional notes on regression
analysis