Notes on linear
regression analysis (pdf file)
Introduction
to linear regression analysis
Mathematics
of simple regression
Regression examples
·
Beer sales vs. price, part 1: descriptive
analysis
·
Beer sales vs. price, part 2: fitting a simple
model
·
Beer sales vs. price, part 3: transformations
of variables
·
Beer sales vs.
price, part 4: additional predictors
·
NC natural gas consumption
vs. temperature
·
More regression datasets
at regressit.com
What to look for in
regression output
What’s a good
value for R-squared?
What's the bottom line?
How to compare models
Testing the assumptions of linear regression
Additional notes on regression analysis
Stepwise and
all-possible-regressions
Excel file with
simple regression formulas
Excel file with regression
formulas in matrix form
Notes on logistic regression (new!)
If you use
Excel in your work or in your teaching to any extent, you should check out the
latest release of RegressIt, a free Excel add-in for linear and logistic
regression. See it at regressit.com. The linear regression version runs on both PC's and Macs and
has a richer and easier-to-use interface and much better designed output than
other add-ins for statistical analysis. It may make a good complement if not a
substitute for whatever regression software you are currently using,
Excel-based or otherwise. RegressIt is an excellent tool for
interactive presentations, online teaching of regression, and development of
videos of examples of regression modeling. It includes extensive built-in
documentation and pop-up teaching notes as well as some novel features to
support systematic grading and auditing of student work on a large scale. There
is a separate logistic
regression version with
highly interactive tables and charts that runs on PC's. RegressIt also now includes
a two-way
interface with R that allows
you to run linear and logistic regression models in R without writing any code
whatsoever.
If you have
been using Excel's own Data Analysis add-in for regression (Analysis Toolpak),
this is the time to stop. It has not
changed since it was first introduced in 1993, and it was a poor design even
then. It's a toy (a clumsy one at that), not a tool for serious work. Visit
this page for a discussion: What's wrong with Excel's Analysis Toolpak for regression
Stepwise and
all-possible-regressions
Stepwise
regression
is a semi-automated process of building a model by successively adding or
removing variables based solely on the t-statistics of their estimated
coefficients. Properly used, the stepwise regression option in Statgraphics (or
other stat packages) puts more power and information at your fingertips than
does the ordinary multiple regression option, and it is especially useful for
sifting through large numbers of potential independent variables and/or
fine-tuning a model by poking variables in or out. Improperly used, it may
converge on a poor model while giving you a false sense of security. It's like
doing carpentry with a chain saw: you can get a lot of work done quickly, but
it leaves rough edges and you may end up cutting off your own foot if you don't
read the instructions, remain sober, engage your brain, and keep a firm grip on
the controls. It is not a tool for beginners or a substitute for education and
experience.
How it
works: Suppose you have some set of potential independent variables from which
you wish to try to extract the best subset for use in your forecasting model.
(These are the variables you will select on the initial input screen.) The
stepwise option lets you either begin with no variables in the model and
proceed forward (adding one variable at a time), or start with all
potential variables in the model and proceed backward (removing one
variable at a time). At each step, the program performs the following
calculations: for each variable currently in the model, it computes the t-statistic
for its estimated coefficient, squares it, and reports this as its
"F-to-remove" statistic; for each variable not in the
model, it computes the t-statistic that its coefficient would
have if it were the next variable added, squares it, and reports this as
its "F-to-enter" statistic. At the next step, the program
automatically enters the variable with the highest F-to-enter statistic,
or removes the variable with the lowest F-to-remove statistic, in
accordance with certain control parameters you have specified. So the key relation to remember is: F = t-squared
In the
multiple regression procedure in most statistical software packages, you can
choose the stepwise variable selection option and then specify the method as
"Forward" or "Backward," and also specify threshold
values for F-to-enter and F-to-remove. (You can also specify
"None" for the method--which is the default setting--in which case it
just performs a straight multiple regression using all the variables.) The program
then proceeds automatically. Under the forward method, at each step, it enters
the variable with the largest F-to-enter statistic, provided that this is greater
than the threshold value for F-to-enter. When there are no variables
left to enter whose F-to-enter statistics are above the threshold, it
checks to see whether the F-to-remove statistics of any variables added
previously have fallen below the F-to-remove threshold. If so, it
removes the worst of them, and then tries to continue. It finally stops when no
variables either in or out of the model have F-statistics on the wrong
side of their respective thresholds. The backward method is similar in
spirit, except it starts with all variables in the model and successively removes
the variable with the smallest F-to-remove statistic, provided that this is
less than the threshold value for F-to-remove.
Whenever a
variable is entered, its new F-to-remove statistic is initially the same
as its old F-to-enter statistic, but the F-to-enter and F-to-remove
statistics of the other variables will generally all change. (Similarly,
when a variable is removed, its new F-to-enter statistic is initially
the same as its old F-to-remove statistic.) Until the F-to-enter
and F-to-remove statistics of the other variables are recomputed, it is
impossible to tell what the next variable to enter or remove will be.
Hence, this process is myopic, looking only one step forward or backward
at any point. (Return to top of page)
There is no
guarantee that the best model that can be constructed from the available
variables (or even a good model) will be found by this one-step-ahead
search procedure. Hence, when the procedure terminates, you should study the
sequence in which variables were added and deleted (which is usually a part of
the output), think about whether the variables that were included or excluded
make sense, and ask yourself if perhaps the addition or removal of a few more
variables might not lead to improvement. For example, the variable with the
lowest F-to-remove or highest F-to-enter may have just missed the
threshold value, in which case you may wish to tweak the F-values and
see what happens. Sometimes adding a variable with a marginal F-to-enter
statistic, or removing one with a marginal F-to-remove statistic, can
cause the F-to-enter statistics of other variables not in the
model to go up and/or or the F-to-remove statistics of other
variables in the model to go down, triggering a new chain of
entries or removals leading to a very different model.
While you're
studying the sequence of variables entered or removed, you should also watch
the value of the adjusted R-squared of the model, which is one of the
statistics shown. Usually it should get consistently larger as the stepwise
process works its magic, but sometimes it may start getting smaller
again. In this case you should make a note of which variables were in the model
when adjusted R-squared hit its largest value--you may wish to return to this
model later on by manually entering or removing variables. (Return
to top of page)
Warning
#1:
For all the models traversed in the same stepwise run, the same data sample
is used, namely the set of observations for which all variables
listed on the original input screen have non-missing values, because the
stepwise algorithm uses a correlation matrix calculated in advance from the
list of all candidate variables. (More about this below.) Therefore, be careful
about including variables which have many fewer observations than the other
variables, such as seasonal lags or differences, because they will shorten the
test period for all models whether they appear in them or not, and regardless
of whether "forward" or "backward" mode is used. After
selecting your final model, you may wish to return to the original input panel,
erase the names of all variables that weren't used in the final model, then
re-fit the model to be sure that the longest possible estimation period was
used.
Warning
#2: If
the number of variables that you select for testing is large compared to
the number of observations in your data set (say, more than 1 variable for
every 10 observations), or if there is excessive multicollinearity (linear
dependence) among the variables, then the algorithm may go crazy and end up
throwing nearly all the variables into the model, especially if you used a low F-to-enter
or F-to-remove threshold.
Warning #3: Sometimes you have a
subset of variables that ought to be treated as a group (say, dummy variables
for seasons of the year) or which ought to be included for logical
reasons. Stepwise regression may
blindly throw some of them out, in which case you should manually put them back
in later.
Warning
#4:
Remember that the computer is not necessarily right in its choice of a
model during the automatic phase of the search. Don't accept a model just
because the computer gave it its blessing. Use your own judgment and intuition
about your data to try to fine-tune whatever the computer comes up with.
Warning #5: Automated regression model selection
methods only look for the most informative variables from among those you start
with, in the limited context of a linear prediction equation, and they cannot make something out of nothing.
If you have insufficient quantity or
quality of data, or if you omit some important variables or fail to use data
transformations when they are needed, or if the assumption of linear or
linearizable relationships is simply wrong, no amount of searching or ranking
will compensate. The most important
steps in statistical analysis are (a) doing your homework before you begin, and
(b) collecting and organizing the relevant data.
See this
page for more details of the dangers and deficiencies of stepwise
regression.
What
method should you use: forward or backward? If you have a very large set
of potential independent variables from which you wish to extract a
few--i.e., if you're on a fishing expedition--you should generally go forward.
If, on the other hand, if you have a modest-sized set of potential
variables from which you wish to eliminate a few--i.e., if you're
fine-tuning some prior selection of variables--you should generally go backward.
(If you're on a fishing expedition, you should still be careful not to cast too
wide a net, lest you dredge up variables that are only accidentally related to
your dependent variable.)
What
values should you use for the F-to-enter and F-to-remove thresholds? As noted above, after
the computer completes a forward run based on the F-to-enter
threshold, it usually takes a backward look based on the F-to-remove
threshold, and vice versa. Hence, both thresholds come into play
regardless of which method you are using, and the F-to-enter threshold must
be greater than or equal to the F-to-remove threshold (to prevent cycling).
Usually the two thresholds are set to the same value. Keeping in mind that the F-statistics
are squares of corresponding t-statistics, an F-statistic
equal to 4 would correspond to a t-statistic equal to 2, which is the
usual rule-of-thumb value for "significance at the 5% level." (4 is
the default value for both thresholds.) I recommend using a somewhat smaller
threshold value than 4 for the automatic phase of the search--for example 3.5
or 3 (but not less than that). Since the automatic stepwise algorithm
is myopic, it is usually OK to let it enter a few too many variables in the
model, and then you can weed out the marginal ones later on by hand. However, beware
of using too low an F threshold if the number of variables is large compared to
the number of observations, or if there is a problem with multicollinearity in
your data (see warning #2 above).
Often this opens the gates to a horde of spurious regressors--and in any
case you should manually apply your usual
standards of relevance and significance to the variables in the model at the
end of the run. Don't just blindly accept the computer's choice. (Return to top of page)
Just in
case you're curious about how it's done: At each step in the stepwise process,
the program must effectively fit a multiple regression model to the variables in
the model in order to obtain their F-to-remove statistics, and it must
effectively fit a separate regression model for each of the variables not
in the model in order to obtain their F-to-enter statistics. When
watching all this happen almost instantaneously on your computer, you may
wonder how it is done so fast. The secret is that it doesn't have
to fit all these models from scratch, and it doesn't need to reexamine all the
observations of each variable. Instead, the stepwise search process can be
carried out merely by performing a sequence of simple transformations on the
correlation matrix of the variables. The variables are only read in once, and
their correlation matrix is then computed (which takes only few seconds even if
there are very many variables). After this, the sequence of adding or removing
variables and recomputing the F-statistics requires only a simple
updating operation on the correlation matrix. This operation is called
"sweeping," and it is similar to the "pivoting" operation
that is at the heart of the simplex method of linear programming, if that means
anything to you. The computational simplicity of the stepwise regression
algorithm re-emphasizes the fact that, in fitting a multiple regression model,
the only information extracted from the data is the correlation matrix
of the variables and their individual means and standard deviations. The same
computational trick is used in all-possible-regressions. (Return
to top of page)
Stepwise regression often works reasonably well as an automatic variable
selection method, but this is not guaranteed. Sometimes it will take a wrong turn and
get stuck in a suboptimal region of model space, and sometimes the model it
selects will be just one out of a number of almost-equally-good models that
ought to be studied together.
All-possible-regressions goes beyond stepwise regression and
literally tests all possible subsets of the set of potential independent
variables. (This is the "Regression Model Selection" procedure in
Statgraphics.) If there are K potential independent variables (besides the
constant), then there are 2K distinct subsets of them to be tested
(including the empty set which corresponds to the mean model). For example, if
you have 10 candidate independent variables, the number of subsets to be tested
is 210, which is 1024, and if you have 20 candidate variables, the
number is 220, which is more than one million. The former analysis
would run almost instantly on your computer, the latter might take a few minutes, and
with 30 variables it might take hours.
All-possible-regressions carries all the caveats of stepwise regression,
and more so. This kind of data-mining is not guaranteed to yield the
model which is truly best for your data, and it may lead you to get absorbed in
top-10 rankings instead of carefully articulating your assumptions,
cross-validating your results, and comparing the error measures of different
models in real terms.
When using
an all-possible-regressions procedure, you are typically given the choice between
several numerical criteria on which to rank the models. The two most commonly
used are adjusted R-squared and the Mallows "Cp" statistic. The
latter statistic is related to adjusted R-squared, but includes a heavier
penalty for increasing the number of independent variables. Cp is not measured
on a 0-to-1 scale. Rather, its values are typically positive and greater than
1, and lower values are better. The models which yield the best (lowest)
values of Cp will tend to be similar to those that yield the best (highest)
values of adjusted R-squared, but the exact ranking may be slightly different.
Other things being equal, the Cp criterion tends to favor models with fewer
parameters, so it is a bit less likely to overfit the data. Generally
you look at the plots of R-squared and Cp versus the number of variables to see
(a) where the point of diminishing returns is reached in terms of the number of
variables, and (b) whether there are one or two models that stand out above the
crowd, or whether there are many almost-equally-good models. Then you can look
at the the actual rankings of models and try to find the optimum place to make
the "cut".
As with any
ranking scheme, it's easy to get lost in the trees and lose sight of the
forest: the differences in performance among the models near the top of the
rankings may not be substantial. (Don't forget that there are dozens, hundreds,
or sometimes thousands of models down below!) An improvement in R-squared from,
say, 75% to 76% is probably not worth increasing the complexity of the model by
adding more independent variables that are not otherwise well-motivated. In fact, it would only reduce the
standard deviation of the errors by 2% (as discussed here), which would not
noticeably shrink the confidence intervals for forecasts. And there's a very
real danger that automated data-mining will lead to the selection of a model
which lacks an intuitive explanation and/or performs poorly out-of-sample.
Among the various
automatic model-selection methods, I find that I generally prefer stepwise to
all-possible regressions. The stepwise approach is much faster, it's less prone
to overfit the data, you often learn something by watching the order in which
variables are removed or added, and it doesn't tend to drown you in details of
rankings data that cause you to lose sight of the big picture.
There are
other methods of parameter estimation and variable selection such as ridge regression and lasso regression that are designed to
deal with situations in which the candidate independent variables are highly
correlated with each and/or their number is large relative to the sample size,
but those methods are beyond the scope of this discussion.