Notes on linear
regression analysis (pdf file)

Introduction
to linear regression analysis

Mathematics
of simple regression

Regression examples

·
Beer sales vs. price, part 1: descriptive
analysis

·
Beer sales vs. price, part 2: fitting a simple
model

·
Beer sales vs. price, part 3: transformations
of variables

·
Beer sales vs.
price, part 4: additional predictors

What to look for in
regression output

What’s a good
value for R-squared?

What's the bottom
line? How to compare models

Testing the assumptions of linear regression

Additional notes on regression analysis

Spreadsheet with
regression formulas

Stepwise and
all-possible-regressions

RegressIt: free Excel add-in for
linear regression and multivariate data analysis

**Stepwise and
all-possible-regressions**

**Stepwise
regression**
is a semi-automated process of building a model by successively adding or
removing variables based solely on the *t*-statistics of their estimated
coefficients. Properly used, the stepwise regression option in Statgraphics (or
other stat packages) puts more power and information at your fingertips than
does the ordinary multiple regression option, and it is especially useful for
sifting through large numbers of potential independent variables and/or
fine-tuning a model by poking variables in or out. Improperly used, it may converge
on a poor model while giving you a false sense of security. It's a bit like
doing carpentry with a chain saw: you can get a lot of work done quickly, but
you may end up doing more harm than good if you don't read the instructions,
remain sober, and keep a firm grip on the controls.

Suppose
you have some set of potential independent variables from which you wish to try
to extract the best subset for use in your forecasting model. (These are the
variables whose names you will type on the initial input screen.) The stepwise
option lets you either begin with *no* variables in the model and proceed *forward*
(adding one variable at a time), or start with *all* potential variables
in the model and proceed *backward* (removing one variable at a time). At
each step, the program performs the following calculations: for each variable
currently *in* the model, it computes the *t*-statistic for its
estimated coefficient, *squares* it, and reports this as its "*F*-to-remove"
statistic; for each variable *not* in the model, it computes the *t*-statistic
that its coefficient *would* have *if it were the next variable added*,
squares it, and reports this as its "*F*-to-enter" statistic. At
the next step, the program automatically enters the variable with the highest *F*-to-enter
statistic, or removes the variable with the lowest *F*-to-remove
statistic, in accordance with certain control parameters you have specified. So
the key relation to remember is: **F = t-squared**

In the
multiple regression procedure in Statgraphics (and most other full-featured
software packages), you can choose the stepwise variable selection option and
then specify the method as "Forward" or "Backward," and
also specify *threshold values* for *F*-to-enter and *F*-to-remove.
(You can also specify "None" for the method--which is the default
setting--in which case it just performs a straight multiple regression using
all the variables.) The program then proceeds automatically. Under the *forward*
method, at each step, it *enters the variable with the largest F-to-enter
statistic*, provided that this is *greater *than the threshold value
for *F*-to-enter. When there are no variables left to enter whose *F*-to-enter
statistics are above the threshold, it checks to see whether the *F*-to-remove
statistics of any variables added previously have fallen *below* the *F*-to-remove
threshold. If so, it removes the worst of them, and then tries to continue. It
finally stops when no variables either in or out of the model have *F*-statistics
on the wrong side of their respective thresholds. The *backward* method is
similar in spirit, except it starts with all variables in the model and
successively *removes the variable with the smallest F-to-remove statistic*,
provided that this is *less *than the threshold value for *F*-to-remove.

Whenever a
variable is entered, its new *F*-to-remove statistic is initially the same
as its old *F*-to-enter statistic, but the *F*-to-enter and *F*-to-remove
statistics of the *other* variables will generally all change. (Similarly,
when a variable is removed, its new *F*-to-enter statistic is initially the
same as its old *F*-to-remove statistic.) Until the *F*-to-enter and *F*-to-remove
statistics of the other variables are recomputed, it is impossible to tell what
the *next *variable to enter or remove will be. Hence, this process is *myopic*,
looking only one step forward or backward at any point. (Return
to top of page)

There is
no guarantee that the *best *model that can be constructed from the
available variables (or even a *good* model) will be found by this
one-step-ahead search procedure. Hence, when the procedure terminates, you should
study the sequence in which variables were added and deleted (which is usually
a part of the output), think about whether the variables that were included or
excluded make sense, and ask yourself if perhaps the addition or removal of a
few more variables might not lead to improvement. For example, the variable
with the lowest *F*-to-remove or highest *F*-to-enter may have just
missed the threshold value, in which case you may wish to tweak the *F*-values
and see what happens. Sometimes adding a variable with a marginal *F*-to-enter
statistic, or removing one with a marginal *F*-to-remove statistic, can
cause the *F*-to-enter statistics of other variables *not* in the
model to go *up* and/or or the *F*-to-remove statistics of other
variables *in* the model to go *down*, triggering a new chain of
entries or removals leading to a very different model.

While
you're studying the sequence of variables entered or removed, you should also
watch the value of the *adjusted R-squared *of the model, which is one of
the statistics shown. Usually it should get consistently larger as the stepwise
process works its magic, but *sometimes it may start getting smaller*
again. In this case you should make a note of which variables were in the model
when adjusted R-squared hit its largest value--you may wish to return to this
model later on by manually entering or removing variables. (Return
to top of page)

**Warning
#1:**
For all the models traversed in the same stepwise run, the *same data sample
is used*, namely the set of observations for which *all* variables
listed on the original input screen have non-missing values, because the
stepwise algorithm uses a correlation matrix calculated in advance from the
list of all candidate variables. (More about this below.) Therefore, be careful
about including variables which have many fewer observations than the other
variables, such as seasonal lags or differences, because they will shorten the
test period for all models whether they appear in them or not, and regardless
of whether "forward" or "backward" mode is used. After
selecting your final model, you may wish to return to the original input panel,
erase the names of all variables that weren't used in the final model, then
re-fit the model to be sure that the longest possible estimation period was
used.

**Warning
#2: **If
the number of variables listed on the original input screen is *large*
compared to the number of observations in your data set (say, more than 1
variable for every 10 observations), or if there is excessive multicollinearity
(linear dependence) among the variables, then the stepwise algorithm may go
crazy and end up throwing nearly all the variables into the model, especially
if you used a low *F*-to-enter or *F*-to-remove threshold. Watch out
for this.

**Warning
#3:**
Remember that *the computer is not necessarily right* in its choice of a
model during the automatic phase of the search. Don't accept a model just
because the computer gave it its blessing. Use your own judgment and intuition
about your data to try to fine-tune whatever the computer comes up with. (Return to top of page)

**What
method should you use: forward or backward?** If you have a very *large* set
of potential independent variables from which you wish to *extract* a
few--i.e., if you're on a fishing expedition--you should generally go *forward*.
If, on the other hand, if you have a *modest-sized* set of potential
variables from which you wish to *eliminate* a few--i.e., if you're
fine-tuning some prior selection of variables--you should generally go *backward*.
(If you're on a fishing expedition, you should still be careful not to cast too
wide a net, lest you dredge up variables that are only accidentally related to
your dependent variable.)

**What
values should you use for the F-to-enter and F-to-remove thresholds? **As noted above,
after Statgraphics (and many other programs) completes a *forward run*
based on the *F*-to-enter threshold, it takes a *backward look* based
on the *F*-to-remove threshold, and vice versa. Hence, *both *thresholds
come into play regardless of which method you are using, and *the F-to-enter
threshold must be greater than or equal to the F-to-remove threshold* (to
prevent cycling). Usually the two thresholds are set to the same value. Keeping
in mind that the *F*-statistics are *squares* of corresponding *t*-statistics,
an *F*-statistic equal to 4 would correspond to a *t*-statistic equal
to 2, which is the usual rule-of-thumb value for "significance at the 5%
level." (4 is the default value for both thresholds.) I recommend using a
somewhat smaller threshold value than 4 for the automatic phase of the
search--for example 3.5 or 3. Since the automatic stepwise
algorithm is myopic, it is usually OK to let it enter a few too many variables
in the model, and then you can weed out the marginal ones later on by hand.
However, *beware of using too low an F threshold if the number of variables
is large compared to the number of observations, or if there is a problem with
multicollinearity in your data (see warning #2 above)*. I would urge you to
resist the urge to use an F-threshold smaller than 3--often this opens the
gates to a horde of spurious regressors--and you should never forget to
"manually" apply your usual standards of significance to the
variables in the model at the end of the run. (Return to top of
page)

**Just in
case you're curious about how it's done:*** *At each step in the stepwise
process, the program must effectively fit a multiple regression model to the
variables *in* the model in order to obtain their *F*-to-remove
statistics, and it must effectively fit a separate regression model for *each*
of the variables *not* in the model in order to obtain their *F*-to-enter
statistics. When watching all this happen almost instantaneously on your
computer, you may wonder how it is done so *fast*. The secret is that it *doesn't*
have to fit all these models from scratch, and it doesn't need to reexamine all
the observations of each variable. Instead, the stepwise search process can be
carried out merely by performing a sequence of simple transformations on the
correlation matrix of the variables. The variables are only read in once, and
their correlation matrix is then computed (which takes only few seconds even if
there are very many variables). After this, the sequence of adding or removing
variables and recomputing the *F*-statistics requires only a simple
updating operation on the correlation matrix. This operation is called
"sweeping," and it is similar to the "pivoting" operation
that is at the heart of the simplex method of linear programming, if that means
anything to you. The computational simplicity of the stepwise regression
algorithm re-emphasizes the fact that, in fitting a multiple regression model,
the *only* information extracted from the data is the correlation matrix
of the variables and their individual means and standard deviations. The same
computational trick is **used in all-possible-regressions.** (Return
to top of page)

Stepwise regression often works reasonably well as an automatic variable
selection method, but this is not guaranteed. Sometimes it will take a wrong turn and
get stuck in a suboptimal region of model space, and sometimes the model it
selects will be just one out of a number of almost-equally-good models that
ought to be studied together.
**All-possible-regressions** goes beyond stepwise regression and
literally tests all possible subsets of the set of potential independent
variables. (This is the "Regression Model Selection" procedure in the
Advanced Regression module of Statgraphics.) If there are K potential
independent variables (besides the constant), then there are 2^{K}
distinct subsets of them to be tested (including the empty set which
corresponds to the mean model). For example, if you have 10 candidate
independent variables, the number of subsets to be tested is 2^{10},
which is 1024, and if you have 20 candidate variables, the number is 2^{20},
which is more than one million. The former analysis would run almost instantly
on your computer, the latter might take a few minutes, and with 30 variables it might
take all day.
All-possible-regressions carries all the caveats of stepwise regression,
*and more so.* This kind of data-mining is not guaranteed to yield the
model which is truly best for your data, and it may lead you to get absorbed in
top-10 rankings instead of carefully articulating your assumptions,
cross-validating your results, and comparing the error measures of different
models in real terms.

When using
an all-possible-regressions procedure, you are typically given the choice
between several numerical criteria on which to rank the models. The two most
commonly used are adjusted R-squared and the Mallows "Cp" statistic.
The latter statistic is related to adjusted R-squared, but includes a heavier
penalty for increasing the number of independent variables. Cp is not measured
on a 0-to-1 scale. Rather, its values are typically positive and greater than
1, and *lower values are better*. The models which yield the best (lowest)
values of Cp will tend to be similar to those that yield the best (highest)
values of adjusted R-squared, but the exact ranking may be slightly different.
Other things being equal, the Cp criterion tends to favor models with fewer
parameters, so it is perhaps more robust to over-fitting the data. Generally
you look at the plots of R-squared and Cp versus the number of variables to see
(a) where the point of diminishing returns is reaching in terms of the number
of variables, and (b) are there one or two models that seem to stand out above
the crowd, or are do there appear to be many equally good models. Then you can
look at the the actual rankings of models and try to find the optimum place to
make the "cut".

As with
any ranking scheme, it's easy to get lost in the trees and lose sight of the
forest: the differences in performance among the models near the top of the
rankings are not substantial. (Don't forget that there are dozens, hundreds, or
sometimes thousands of models down below!) An improvement in R-squared from, say,
75% to 76% is probably not worth increasing the complexity of the model by
adding more independent variables that are not otherwise well-motivated. In fact, it would only reduce the
standard deviation of the errors by 2% (as discussed here), which would not
noticeably shrink the confidence intervals for forecasts. And there's a very
real danger that automated data-mining will lead to the selection of a model
which lacks an intuitive explanation and/or performs poorly out-of-sample.

Finally,
remember that **a model selected by automatic methods can only find the
"best" combination from among the set of variables you start with**:
if you omit some important variables or *lags or transformations thereof*,
no amount of searching or ranking will compensate!

Among the
various automatic model-selection methods, I find that I generally prefer
stepwise to all-possible regressions. The stepwise approach is much faster,
it's less prone to overfit the data, you often learn something by watching the
order in which variables are removed or added, and it doesn't tend to drown you
in details of rankings data that cause you to lose sight of the big picture.

There are
other methods of parameter estimation and variable selection such as **ridge regression **and **lasso regression** that are designed to
deal with situations in which the candidate independent variables are highly
correlated with each and/or their number is large relative to the sample size,
but those methods are beyond the scope of this discussion.