Notes on linear
regression analysis (pdf file)

Introduction
to linear regression analysis

Mathematics
of simple regression

Regression examples

·
Beer sales vs. price, part 1: descriptive
analysis

·
Beer sales vs. price, part 2: fitting a simple
model

·
Beer sales vs. price, part 3: transformations
of variables

·
Beer sales vs.
price, part 4: additional predictors

·
NC natural gas
consumption vs. temperature

What to look for in
regression output

What’s a good
value for R-squared?

What's the bottom line? How to compare models

Testing the assumptions of
linear regression

Additional notes on regression analysis

Stepwise and
all-possible-regressions

Excel file with
simple regression formulas

Excel file with regression
formulas in matrix form

*Latest
news:** If you are at least
a part-time user of Excel, you should check out the new release of RegressIt, a
free add-in developed by the author of this site. See it at http://regressit.com.
The linear regression version runs on both PC's and Macs and has a richer and
easier-to-use interface and much better designed output than other add-ins for
statistical analysis. It may make a good complement if not a substitute for whatever
regression software you are currently using, Excel-based or otherwise. (If you have
been using Excel's analysis toolpak for regression, this is the time to stop.) RegressIt
now includes a two-way
interface with R that allows you to run linear and logistic regression
models in R without writing any code whatsoever. It also includes extensive built-in
documentation and pop-up teaching notes. There is a separate logistic
regression version with interactive tables and charts that runs on PC's.*

Standard error of the regression and other measures of error
size

Adjusted R-squared (not the bottom line!)

Significance of the estimated coefficients

Values of the estimated coefficients

Plots of forecasts and residuals (important!)

Out-of-sample validation

For a sample of output that illustrates the various
topics discussed here, see the “Regression
Example, part 2” page.

**(i)
Standard error of the regression (root-mean-squared error adjusted for degrees
of freedom): ** Does the current regression model yield
smaller errors, on average, than the best model previously fitted, and is the
improvement significant in *practical*
terms? In regression modeling, the
best single error statistic to look at is the standard error of the regression,
which is the estimated standard deviation of the unexplainable variations in
the dependent variable. (It is
approximately the standard deviation of the errors, apart from the
degrees-of-freedom adjustment.)
This what your software is trying to minimize when estimating
coefficients, and it is a sufficient statistic for describing properties of the
errors if the model’s assumptions are all correct.

Furthermore,
the **standard error of the regression is
a lower bound on the standard error of any forecast generated from the model**. In general the forecast standard error
will be a little larger because it also takes into account the errors in
estimating the coefficients and the relative extremeness of the values of the
independent variables for which the forecast is being computed. If the sample size is large and the
values of the independent variables are not extreme, the forecast standard
error will be only slightly larger than the standard error of the
regression. (See page 14 of these
notes for more details.)

You should
directly compare the standard error of the regression between models *only*
if their units are the same and they are fitted to the same (or almost the
same) sample of the same dependent variable. RegressIt provides a Model Summary Report that
shows side-by-side comparisons of error measures and coefficient estimates for
models fitted to the same dependent variable, in order to make such comparisons
easy, although sample sizes may vary if there are missing values in any
independent variables that are not included in all models.

In time
series forecasting, it is common to look not only at root-mean-squared error
but also the **mean absolute error** **(MAE)** and, for positive data, the **mean absolute percentage error** **(MAPE)** in evaluating and comparing
model performance. The latter
measures are easier for non-specialists to understand and they are less
sensitive to extreme errors, if the occasional big mistake is not a serious
concern. Also, it is sometimes appropriate to compare MAPE between models
fitted to different samples of data, because it is a relative rather than
absolute measure. These two
statistics are not routinely reported by most regression software, however.

Whenever you
are working with time series data, you should also ask: does the current regression model
improve on the best *naive* (random walk or random trend) model, according
to these error measures? The **mean absolute
scaled error** statistic measures improvement in mean absolute error
relative to a random-walk-without-drift model. And how has the model been doing lately?
Are its
most recent errors typical in size and random-looking, or are they getting
bigger or more biased?

**(ii)
Adjusted R-squared:**
This is R-squared (the fraction by which the variance of the errors is less
than the variance of the dependent variable) adjusted for the number of
coefficients in the model relative to the sample size in order to correct it
for bias (the same adjustment used in computing the standard error of the
regression). That is, adjusted
R-squared is the fraction by which the square of the standard error of the
regression is less than the variance of the dependent variable. **It
is the most over-used and abused of all statistics--don't get obsessed with it.** R-squared is not the bottom line. All it
measures is the percentage reduction in mean-squared-error that the regression
model achieves relative to the naive model "Y=constant", which may or
may not be the appropriate naive model for purposes of comparison. Better to
determine the *best* naive model first, and then compare the various error
measures of your regression model (both in the estimation and validation
periods) against that naive model.

Despite the
fact that adjusted R-squared is a unitless statistic, there is no absolute
standard for what is a "good" value. A regression model fitted to
non-stationary time series data can have an adjusted R-squared of 99% and yet
be inferior to a simple random walk model. On the other hand, a regression
model fitted to stationarized time series data might have an adjusted R-squared
of 10%-20% and still be considered useful (although out-of-sample validation
would be advisable--see section (vi) below). A designed experiment looking for
small but statistically significant effects in a very large sample might accept
even lower values. See this page for more about
these issues. (Return
to top of page.)

**(iii)
Significance of the estimated coefficients:** Are the t-statistics greater than 2
in magnitude, corresponding to p-values less than 0.05? If they are not, you should probably try
to *refit the model with the least significant variable excluded*, which
is the "backward stepwise" approach to model refinement.

Remember
that the t-statistic is just the *estimated coefficient divided by its own
standard error*. Thus, it measures "how many standard deviations from
zero" the estimated coefficient is, and it is used to test the hypothesis
that the true value of the coefficient is non-zero, in order to confirm that
the independent variable really belongs in the model.

The p-value
is the probability of observing a t-statistic that large or larger in magnitude
given the null hypothesis that the true coefficient value is zero. If the
p-value is greater than 0.05--which occurs roughly when the t-statistic is less
than 2 in absolute value--this means that the coefficient may be only
"accidentally" significant.

*There's nothing magical about the 0.05
criterion*,
but in practice it usually turns out that a variable whose estimated
coefficient has a p-value of greater than 0.05 can be dropped from the model
without affecting the error measures very much--try it and see! (Return
to top of page.)

**(iv) Values**** of the estimated
coefficients:**
In general you are interested not only in the *statistical* significance of an independent variable, you are also
interested in its *practical*
significance. What does it imply in real terms? What have you learned, and how
should you spend your time or money? In theory, the coefficient of a given
independent variable is its proportional effect on the average value of the
dependent variable, others things being equal. In business and weapons-making, this is
often called "bang for the buck". Such information can be very useful
for decision-making if some of the independent variables are under your
control, for example, the amount of a drug administered to a patient, the price
of a product, or the amount of money spent on promoting it. (See this page for an example involving the effects of several
prices.) Keep in mind that when
sample sizes are very large, an effect that is really quite tiny (say, the
marginal benefit of an expensive new medical treatment) could appear to be
quite large if all you look at is its t-statistic!

In some
cases the interesting hypothesis is not whether the value of a certain
coefficient is equal to zero, but whether it is equal to some other value. For
example, if one of the independent variables is merely the dependent variable
lagged by one period (i.e., an autoregressive term), then the interesting
question is whether its coefficient is equal to *one*. If so, then the
model is effectively predicting the *difference* in the dependent
variable, rather than predicting its level, in which case you can simplify the
model by differencing the dependent variable and deleting the lagged version of
itself from the list of independent variables.

Sometimes
patterns in the magnitudes and *signs* of lagged variables are of
interest. For example if both X and LAG(X,1) are
included in the model, and their estimated coefficients turn out to have
similar magnitudes but opposite signs, this suggests that they could both be
replaced by a single DIFF(X) term. (Return to top of page.)

**(v)
Plots of forecasts and residuals:**
DO NOT FAIL TO LOOK AT PLOTS OF THE FORECASTS AND ERRORS. (Some software
makes this hard: it may be
necessary to execute a separate procedure or write additional code in order to
produce a single plot, and even a small amount of extra work is sometimes a
barrier to careful analysis.) Do
the forecasts "track" the data in a satisfactory way, apart from the
inevitable regression-to-the mean? (In the case of time series data, you are
especially concerned with how the model fits the data at the "business
end", i.e., the most recent values.
An example of a very bad fit is given here.)
Do the residuals appear random, or do you see some systematic patterns in their
signs or magnitudes? Are they free
from trends, autocorrelation, and heteroscedasticity? Are they normally
distributed? There are a variety of statistical tests for these sorts of
problems, but the best way to determine whether they are present and whether
they are serious is to *look at the
pictures*.

If
heteroscedasticity and/or non-normality is a problem, you may wish to consider
a nonlinear transformation of the dependent variable, such as logging or deflating,
if such transformations are appropriate for your data. (Remember that logging
converts multiplicative relationships to additive relationships, so that when
you log the dependent variable, you are implicitly assuming that the
relationships among the original variables are multiplicative.)

If
autocorrelation is a problem, you should probably consider changing the model
so as to implicitly or explicitly include *lagged variables*--e.g., try
stationarizing the dependent and independent variables via differencing, or add
lags of the dependent and/or independent variables to the regression equation,
or introduce an autoregressive error correction. In Statgraphics, you can just enter
DIFF(X) or LAG(X,1) as the variable name if you want
to use the first difference or 1-period-lagged value of X in the input to a
procedure. In RegressIt, lagging
and differencing are options on the Variable
Transformation menu. Of course,
when working in Excel, it is possible to use formulas to create transformed
variables of any kind, although there are advantages to letting the software do
it for you: it makes the process more user-friendly, it reduces the possibility
for error, and it makes the output self-documenting in terms of how transformed
variables were created.

You do not
usually *rank *(i.e., choose among) models on the basis of their residual
diagnostic tests, but bad residual diagnostics indicate that the model's error
measures may be unreliable and that there are probably better models out there
somewhere. (Return to top of page.)

**(vi) Out-of-sample**** validation:** If you have enough
data to hold out a sizable portion for validation and if your software offers this
feature, you should compare the performance of the models in the validation as
well as estimation periods. (See this page for an example of
out-of-sample validation.) A good
model should have small error measures in *both* the estimation and
validation periods, compared to other models, and its validation period
statistics should be similar to its own estimation period statistics.
Regression models with many independent variables are especially susceptible to
overfitting the data in the estimation period, so watch out for models that
have suspiciously low error measures in the estimation period and
disappointingly high error measures in the validation period.

If the
variance of the errors in original, *untransformed* units is growing over
time due to inflation or compound growth, then the best statistic to use for
comparisons between the estimation and validation period is mean absolute *percentage*
error, rather than mean squared error or mean absolute error.

Although the
model's performance in the validation period is *theoretically* the best
indicator of its forecasting accuracy, especially for time series data, you
should be aware that the hold-out sample may not always be highly representative,
especially if it is small--say, less than 20 observations. If your validation
period statistics appear strange or contradictory, you may wish to experiment
by changing the number of observations held out. Sometimes the inclusion or
exclusion of a few unusual observations can make a big a difference in the
comparative statistics of different models.

Also, be
aware that if you test a *large number of models* and rigorously rank them
on the basis of their validation period statistics, you may end up with just as
much "data snooping bias" as if you had only looked at
estimation-period statistics--i.e., you may end up picking a model that is more
lucky than good! The best defense against this is to choose the *simplest*
and most *intuitively plausible* model that gives comparatively good
results. (Return to top of page.)

Go on to next topic: What’s a good value for R-squared?