Notes on linear
regression analysis (pdf file)

Introduction
to linear regression analysis

Mathematics of
simple regression

Regression examples

·
Beer sales vs. price, part 1: descriptive
analysis

·
Beer sales vs. price, part 2: fitting a simple
model

·
Beer sales vs. price, part 3: transformations
of variables

·
Beer sales vs.
price, part 4: additional predictors

·
NC natural gas
consumption vs. temperature

·
More regression datasets
at regressit.com

What to look for in
regression output

What’s
a good value for R-squared?

What's the bottom line? How to compare models

Testing the assumptions of linear regression

Additional notes on regression
analysis

Stepwise and all-possible-regressions

Excel file with
simple regression formulas

Excel file with regression formulas
in matrix form

Notes on *logistic* regression (new!)

*If you use
Excel in your work or in your teaching to any extent, you should check out the
latest release of RegressIt, a free Excel add-in for linear and logistic
regression. See it at **regressit.com*** . **The linear regression version runs on both PC's and Macs and
has a richer and easier-to-use interface and much better designed output than
other add-ins for statistical analysis. It may make a good complement if not a
substitute for whatever regression software you are currently using,
Excel-based or otherwise.

*If you have
been using Excel's own Data Analysis add-in for regression (Analysis Toolpak),
this is the time to stop.** It has not
changed since it was first introduced in 1993, and it was a poor design even
then. It's a toy (a clumsy one at that), not a tool for serious work. Visit
this page for a discussion: **What's wrong with Excel's Analysis Toolpak for regression*

Formulas
for the slope and intercept of a simple regression model

Formulas for
R-squared and standard error of the regression

Formulas
for standard errors and confidence limits for means and forecasts

To set the stage for discussing the formulas
used to fit a simple (one-variable) regression model, let′s briefly
review the formulas for the mean
model, which can be considered as a constant-only (zero-variable)
regression model. You can use
regression software to fit this model and produce all of the standard table and
chart output by merely not selecting any independent variables. R-squared will be zero in this case,
because the mean model does not explain any of the variance in the dependent
variable: it merely measures it.

The forecasting equation of the mean model
is:

...where b_{0} is the sample mean:

The **sample mean** has the (non-obvious)
property that it **is the value around
which the mean squared deviation of the data is minimized**, and the same
least-squares criterion will be used later to estimate the "mean
effect" of an independent variable.

The error that the mean model makes for
observation t
is therefore the deviation of Y from
its historical average value:

The **standard
error of the model**, denoted by s, is
our estimate of **the standard deviation
of the noise in ****Y**** **(the variation in it that is considered unexplainable).
Smaller is better, other things being equal: we want the model to explain as
much of the variation as possible. In the mean model, the standard error of the
model is just is the sample standard deviation of Y:

(Here and elsewhere, STDEV.S denotes the *sample* standard deviation of X, using Excel
notation. The *population* standard
deviation is STDEV.P.) Note that the
standard error of the model is *not*
the square root of the average value of the squared errors within the
historical sample of data. Rather, the sum of squared errors is divided by n-1 rather than n under the square root
sign because this adjusts for the fact that a "degree of freedom for error″
has been used up by estimating one model parameter (namely the mean) from the
sample of n data points.

The accuracy of the estimated mean is
measured by the** standard error of the
mean**, whose formula in the mean model is:

This is the estimated standard deviation of
the error in estimating the mean. Notice that it is inversely proportional to the
square root of the sample size, so it tends to go down as the sample size goes
up. For example, if the sample size is increased by a factor of 4, the standard
error of the mean goes down by a factor of 2, i.e., our estimate of the mean
becomes twice as precise.

The accuracy of a forecast is measured by the** standard error of the forecast**, which
(for both the mean model and a regression model) is the square root of the sum
of squares of the standard error of the model and the standard error of the
mean:

This is the estimated
standard deviation of the error in the forecast, which is not quite the same
thing as the standard deviation of the unpredictable variations in the data
(which is s). It takes into
account both the unpredictable variations in Y *and *the error
in estimating the mean. In the mean model, the standard error of the mean is a
constant, while in a regression model it depends on the value of the independent
variable at which the forecast is computed, as explained in more detail below.

The standard error of the forecast gets
smaller as the sample size is increased, but only up to a point. More data
yields a systematic reduction in the standard error of the mean, but it does *not* yield a systematic reduction in the
standard error of the model. The standard error of the model will change to
some extent if a larger sample is taken, due to sampling variation, but it
could equally well go up or down. The variations in the data that were
previously considered to be inherently unexplainable remain inherently
unexplainable if we continue to believe in the model′s assumptions, so
the standard error of the model is always a lower bound on the standard error
of the forecast.

**Confidence
intervals**
for the mean and for the forecast are equal to **the point estimate plus-or-minus the appropriate standard error
multiplied by the appropriate 2-tailed critical value of the**** t ****distribution**. The critical value
that should be used depends on the number of degrees of freedom for error (the
number data points minus number of parameters estimated, which is n-1 for this model) and
the desired level of confidence. It can be computed in Excel using the **T.INV.2T** function. So, for example, a
95% confidence interval for the forecast is given by

In
general, T.INV.2T(0.05, n-1)
is fairly close to 2 except for very small samples, i.e., a 95% confidence
interval for the forecast is roughly equal to the forecast plus-or-minus two
standard errors. (In older versions of Excel, this function was just called
TINV.) Return
to top of page.

**Formulas for the slope and intercept of a simple
regression model:**

Now let's regress. A simple regression model
includes a single independent variable, denoted here by X, and its forecasting
equation in real units is

It differs from the
mean model merely by the addition of a multiple of X_{t} to the forecast. The
estimated constant b_{0} is
the **Y****-intercept** of the regression
line (usually just called "the intercept" or "the
constant"), which is the value that would be predicted for Y at X = 0. The estimated
coefficient b_{1} is the **slope** of the regression line, i.e., the
predicted change in Y per
unit of change in X.
The
simple regression model reduces to the mean model in the special case where the
estimated slope is exactly zero. The estimated slope is almost never exactly
zero (due to sampling variation), but if it is not significantly different from
zero (as measured by its t-statistic), this suggests that the mean model should
be preferred on grounds of simplicity unless there are good a priori reasons
for believing that a relationship exists, even if it is largely obscured by
noise.

Usually we do not
care too much about the exact value of the intercept or whether it is
significantly different from zero, unless we are really interested in what happens
when X goes to
"absolute zero" on whatever scale it is measured. Often X is a variable which
logically can never go to zero, or even close to it, given the way it is
defined. So, attention usually focuses mainly on the slope coefficient in the
model, which measures the change in Y to be expected per unit of change in X as both variables
move up or down relative to their historical mean values on their own natural
scales of measurement.

The coefficients, standard errors, and
forecasts for this model are obtained as follows. First we need to compute the** coefficient of correlation** between Y and X, commonly denoted by
**r**** _{XY}**, which measures the
strength of their linear relation on a relative scale of -1 to +1. There are various formulas for it, but
the one that is most intuitive is expressed in terms of the

... where STDEV.P(X) is the *population*
standard deviation, as noted above.
(Sometimes the *sample* standard
deviation is used to standardize a variable, but the population standard
deviation is needed in this particular formula.) Y^{*}
will denote the similarly standardized value of Y.

**The
correlation coefficient is equal to the average product of the standardized
values of the two variables:**

It is intuitively obvious that this statistic
will be positive [negative] if X_{ }and Y tend to move in the same
[opposite] direction relative to their respective means, because in this case X^{*}_{ }and Y^{*} will tend to have the same [opposite]
sign. Also, if X_{ }and Y are *perfectly*
positively correlated, i.e., if Y_{ }is an *exact* positive linear function of X, then Y^{*}_{t }= X^{*}_{t} for all t, and the formula for
**r**_{XY}** _{ }**reduces to (STDEV.P(X)/STDEV.P(X))

**The
least-squares estimate of the slope coefficient (****b _{1}**

The ratio of standard deviations on the RHS
of this equation merely serves to scale the correlation coefficient
appropriately for the real units in which the variables are measured. (The
sample standard deviation could also be used here, because they only differ by
a scale factor.)

**The least-squares
estimate of the intercept is the mean of ****Y ****minus the slope coefficient times the mean of**** X:**

This equation implies that Y must be predicted to
be equal to its own average value whenever X is equal to its own average value.

The standard error of the model (denoted again by s) is usually referred
to as the **standard error of the
regression **(or sometimes the "standard error of the estimate") in
this context, and it **is equal to the square
root of {the sum of squared errors divided by ****n****-****2**}, or equivalently,
the **standard deviation of the errors
multiplied by the square root of ****(n-1)/(n-2)**, where the latter
factor is a number slightly larger than 1:

The sum of squared
errors is divided by n-2 in this calculation rather than n-1 because an additional degree of freedom for
error has been used up by estimating two parameters (a slope and an intercept)
rather than only one (the mean) in fitting the model to the data**. **The standard error of the regression
is an unbiased estimate of the standard deviation of the noise in the data,
i.e., the variations in Y that
are not explained by the model.

Each of the two model parameters, the slope and
intercept, has its* own* standard
error, which is the estimated standard deviation of the error in estimating it.
(In general, the term "standard error" means "standard deviation
of the error" in whatever is being estimated. ) The standard error of the
intercept is

which looks exactly
like the formula for the standard error of the mean in the mean model, except
for the additional term of
(AVERAGE(X))^{2}/VAR.P(X)
under
the square root sign. This term reflects the additional uncertainty about the
value of the intercept that exists in situations where the center of mass of the
independent variable is far from zero (in relative terms), in which case the
intercept is determined by extrapolation far outside the data range. The
standard error of the slope coefficient is given by:

...which also looks
very similar, except for the factor of STDEV.P(X) in the denominator. Note that s is measured in units
of Y and STDEV.P(X) is measured in units
of X, so SE_{b}_{1} is measured (necessarily) in "units of Y per unit of X",
the same as b_{1}_{ }itself. The terms in
these equations that involve the variance or standard deviation of X merely serve to
scale the units of the coefficients and standard errors in an appropriate way.

You don′t need
to memorize all these equations, but there is one important thing to note: **the standard errors of the coefficients are
directly proportional to the standard error of the regression and inversely
proportional to the square root of the sample size. **This means that noise in
the data (whose intensity if measured by s) affects the errors in all the coefficient estimates in
exactly the same way, and it also means that 4 times as much data will tend to
reduce the standard errors of the all coefficients by approximately a factor of
2, assuming the data is really all generated from the same model, and a really
huge of amount of data will reduce them to zero.

However,
**more data will not systematically reduce
the standard error of the regression**. As with the mean model, variations that
were considered inherently unexplainable before are still not going to be
explainable with more of the same kind of data under the same model
assumptions. As the sample size gets larger, the standard error of the
regression merely becomes a more accurate estimate of the standard deviation of
the noise. Return to top of page.

**Formulas for
R-squared and standard error of the regression**

**The fraction of the variance of ****Y ****that is
"explained" **by the simple regression model, i.e., the percentage by
which the sample variance of the errors ("residuals") is less than
the sample variance of Y
itself, **is equal to the square of the
correlation** between them, i.e., "R squared":

Equivalently:

Thus,* *for example, if the correlation is r_{XY} = 0.5, then r_{XY}^{2 }_{= }0.25, so the simple
regression model explains 25% of the variance in Y in the sense that the sample variance of the
errors of the simple regression model is 25% less than the sample variance of Y. This is not supposed
to be obvious. It is a "strange but true" fact that can be proved
with a little bit of calculus.

By taking square roots everywhere, the same
equation can be rewritten in terms of standard deviations to show that t**he standard deviation of the errors is
equal to the standard deviation of the dependent variable times the square root
of 1-minus-the-correlation-squared**:

However, the sample
variance and standard deviation of the errors are not *unbiased* estimates of the variance and standard deviation of the unexplained
variations in the data, because they do not into account the fact that 2
degrees of freedom for error have been used up in the process of estimating the
slope and intercept. The fraction by which the square of the* standard error of the regression *is
less than the sample variance of Y (which is the fractional reduction in unexplained variation compared to
using the mean model) is the "adjusted" R-squared of the model, and
in a simple regression model it is given by the formula

.

The factor of (n-1)/(n-2) in this equation is the same adjustment for
degrees of freedom that is made in calculating the standard error of the
regression. In fact, adjusted R-squared can be used to determine the standard
error of the regression from the sample standard deviation of Y in exactly the same
way that R-squared can be used to determine the sample standard deviation of
the errors as a fraction of the sample standard deviation of Y:

You can apply this equation without even
calculating the model coefficients or the actual errors!

In a multiple regression model with k independent variables
plus an intercept, the number of degrees of freedom for error is n-(k+1), and the formulas for the standard error of
the regression and adjusted R-squared remain the same except that the n-2 term is replaced by n-(k +1) .

It follows from the equation above that if
you fit simple regression models to the same sample of the same dependent
variable Y with different
choices of X as the independent
variable, then** adjusted R-squared
necessarily goes up as the standard error of the regression goes down, and vice
versa. **Hence, it is equivalent to say that your goal is to minimize the
standard error of the regression or to maximize adjusted R-squared through your
choice of X, other things being
equal. However, as I will keep saying, the standard error of the regression is
the real "bottom line" in your analysis: it measures the variations
in the data that are not explained by the model in real economic or physical
terms.

Adjusted R-squared can actually be negative
if X has no measurable
predictive value with respect to Y. In particular, if the correlation between X and Y is exactly zero, then
R-squared is exactly equal to zero, and adjusted R-squared is equal to 1 - (n-1)/(n-2), which is negative because the ratio (n-1)/(n-2) is greater than 1. If this is the case,
then the mean model is clearly a better choice than the regression model. Some
regression software will not even display a negative value for adjusted
R-squared and will just report it to be zero in that case. Return
to top of page.

**Formulas for standard errors and confidence limits for
means and forecasts**

The **standard
error of the mean** of Y for
a given value of
X is the
estimated standard deviation of the error in measuring the height of the
regression line at that location, given by the formula

This looks like a lot like the formula for
the standard error of the mean in the mean model: it is proportional to the
standard error of the regression and inversely proportional to the square root
of the sample size, so it gets steadily smaller as the sample size gets larger,
approaching zero in the limit even in the presence of a lot of noise. However,
in the regression model the standard error of the mean also depends to some
extent on the value of X,
so the term is
scaled up by a factor that is greater than 1 and is larger for values of X that are farther from
its mean, because there is relatively greater uncertainty about the true height
of the regression line for values of X that are farther from its historical mean value.

The **standard
error for the forecast** for Y for
a given value of
X is then
computed in exactly the same way as it was for the mean model:

In the regression model it is larger for values of X that are farther from
the mean--i.e., you expect to make bigger forecast errors when extrapolating
the regression line farther out into space--because SE_{mean}(X) is larger for more extreme values of X. The standard error of the
forecast is not quite as sensitive to X in relative terms as is the standard
error of the mean, because of the presence of the noise term s^{2 }under the square root sign. (Remember that s^{2} is the estimated variance of the noise in the data.) In fact, s is
usually much larger than SE_{mean}(X) unless the data set is very small or X is very
extreme, so usually the standard error of the forecast is not too much larger
than the standard error of the regression.

Finally, **confidence limits for
means and forecasts** are calculated in the usual way, namely as the forecast
plus or minus the relevant standard error times the critical t-value for the
desired level of confidence and the number of degrees of freedom, where the
latter is n-2 for a simple regression model. For all but
the smallest sample sizes, a 95% confidence interval is approximately equal to
the point forecast plus-or-minus two standard errors, although there is nothing
particularly magical about the 95% level of confidence. You can choose your
own, or just report the standard error along with the point forecast.

Here are a couple of additional pictures that
illustrate the behavior of the standard-error-of-the-mean and the
standard-error-of-the-forecast in the special case of a simple regression
model. Because the standard error of the mean gets larger for extreme
(farther-from-the-mean) values of X, the confidence intervals for the mean (the height of the regression
line) widen noticeably at either end.

The confidence
intervals for predictions also get wider when X goes to extremes, but the effect is not quite as dramatic, because the
standard error of the regression (which is usually a bigger component of
forecast error) is a constant. Note that the inner set of confidence bands
widens more in relative terms at the far left and far right than does the outer
set of confidence bands.

But remember: the standard errors and confidence
bands that are calculated by the regression formulas are all based on the
assumption that the model is correct, i.e., that the data really is described
by the assumed linear equation with normally distributed errors. If the model
assumptions are not correct--e.g., if the wrong variables have been included or
important variables have been omitted or if there are non-normalities in the
errors or nonlinear relationships among the variables--then the predictions and
their standard errors and confidence limits may all be suspect. So, when we fit
regression models, we don′t just look at the printout of the model
coefficients. We look at various other statistics and charts that shed light on
the validity of the model assumptions. Return to top of page.

1. The coefficients and error measures for a
regression model are entirely determined by the following summary statistics: **means**, **standard deviations** and **correlations**
among the variables, and the **sample size**.

2. The correlation between Y and X , denoted by **r**** _{XY}**, is equal to

3. The**
slope coefficient** in a simple regression of Y on X is the correlation between Y
and X multiplied by the ratio of their standard
deviations:

Either the population or sample standard deviation
(STDEV.S) can be used in this formula because they differ only by a
multiplicative factor.

4. In a simple regression model, the percentage of
variance "explained" by the model, which is called R-squared, is the
square of the correlation between Y and X. That
is, **R-squared =**** r**_{XY}^{2}**,** and that′s why
it′s called R-squared. This means that the sample standard deviation of
the errors is equal to {the square root of 1-minus-R-squared} times the sample
standard deviation of Y:

**STDEV.S(errors) =
(SQRT(1 minus R-squared)) **x** STDEV.S(Y).**

So, if you know the standard deviation of Y, and you know the
correlation between Y and X, you can figure out
what the standard deviation of the errors would be be if you regressed Y on X. However...

5. The sample standard deviation of the
errors is a *downward-biased* estimate
of the size of the true unexplained deviations in Y because it does not adjust for the additional
"degree of freedom" used up by estimating the slope coefficient. An *unbiased* estimate of the standard
deviation of the true errors is given by the *standard error of the regression*, denoted by s. In the special case
of a simple regression model, it is:

**Standard error of
regression = STDEV.S(errors) **x** SQRT((n****-1)/(n****-2))**

*This is the real
bottom line*,
because the standard deviations of the errors of all the forecasts and
coefficient estimates are directly proportional to it (if the model′s
assumptions are correct!!)

6. *Adjusted*
R-squared, which is obtained by adjusting R-squared for the degrees if freedom
for error in exactly the same way, is an unbiased estimate of the amount of
variance explained:

**Adjusted R-squared = 1 ****- ((n****-1)/(n****-2)) **x** (1 ****-
R-squared).**

For large values of n, there isn′t
much difference.

In a multiple regression model in which k is the number of
independent variables, the n-2 term that appears in
the formulas for the standard error of the regression and adjusted R-squared
merely becomes n-(k+1).

7. The important thing about adjusted
R-squared is that:

**Standard error of the
regression = (SQRT(1 minus adjusted-R-squared)) **x**
STDEV.S(Y).**

So, for models fitted to the same sample of
the same dependent variable, adjusted R-squared always goes up when the
standard error of the regression goes down.

**A model
does not always improve when more variables are added:** adjusted R-squared
can go down (even go negative) if *irrelevant
*variables are added.

8. The **standard
error of a coefficient estimate** is the estimated standard deviation of the
error in measuring it. Also, the estimated height of the regression line for a
given value of X
has its
own standard error, which is called the **standard
error of the mean at**** X****.** All of these standard errors are proportional to the
standard error of the regression divided by the square root of the sample size.
So a greater amount of "noise" in the data (as measured by s) makes all the
estimates of means and coefficients proportionally *less *accurate, and a larger sample size makes all of them *more* accurate (4 times as much data reduces
all the standard errors by a factor of 2, etc.). However, more data will not
systematically reduce the standard error of the *regression*. Rather, the standard error of the regression will
merely become a more accurate estimate of the true standard deviation of the
noise.

9. The **standard
error of the forecast** for Y at
a given value of
X **is the square root of the sum of squares of
the standard error of the regression and the standard error of the mean at**** **X.
The
standard error of the mean is usually a lot smaller than the standard error of
the regression except when the sample size is very small and/or you are trying
to predict what will happen under very extreme conditions (which is dangerous),
so the standard error of the forecast is usually only slightly larger than the
standard error of the regression. (Recall that under the mean model, the
standard error of the mean is a constant. In a simple regression model, the
standard error of the mean depends on the value of X, and it is larger
for values of X
that are
farther from its own mean.)

10. **Two-sided
confidence limits for coefficient estimates, means, and forecasts are all equal
to their point estimates plus-or-minus the appropriate critical t-value times
their respective standard errors**. For a simple regression model, in which
two degrees of freedom are used up in estimating both the intercept and the
slope coefficient, the appropriate critical t-value is** T.INV.2T(1 ****-**** C, n ****-**** 2) **in
Excel,** **where C is the desired level of confidence and n is the sample size.
The usual default value for the confidence level is 95%, for which the critical
t-value is **T.INV.2T(0.05, n ****-**** 2).**

The
accompanying Excel file with
simple regression formulas shows how the calculations described above can
be done on a spreadsheet, including a comparison with output from RegressIt. For the case in which there are two or
more independent variables, a so-called **multiple
regression **model, the calculations are not too much harder if you are
familiar with how to do arithmetic with vectors and matrices. Here is an Excel
file with regression formulas in matrix form that illustrates this process. Return
to top of page.

Go on to next topic: example of a simple regression
model