RegressIt: free Excel add-in for linear regression and multivariate data analysis

**Stepwise regression** is a
semi-automated process of building
a model by successively adding or removing variables based solely
on the *t*-statistics of their estimated coefficients. Properly
used, the stepwise regression option in Statgraphics (or other
stat packages) puts more power and information at your fingertips
than does the ordinary multiple regression option, and it is especially
useful for sifting through large numbers of potential independent
variables and/or fine-tuning a model by poking variables in or
out. Improperly used, it may converge on a poor model while giving
you a false sense of security. It's a bit like doing carpentry
with a chain saw: you can get a lot work done quickly, but you
may end up doing more harm than good if you don't read the
instructions,
remain sober, and keep a firm grip on the controls.

Suppose you have some set of
potential independent variables from
which you wish to try to extract the best subset for use in your
forecasting model. (These are the variables whose names you will
type on the initial input screen.) The stepwise option lets you
either begin with *no* variables in the model and proceed
*forward* (adding one variable at a time), or start with
*all* potential variables in the model and proceed *backward*
(removing one variable at a time). At each step, the program performs
the following calculations: for each variable currently *in*
the model, it computes the *t*-statistic for its estimated
coefficient, *squares* it, and reports this as its "*F*-to-remove"
statistic; for each variable *not* in the model, it computes
the *t*-statistic that its coefficient *would* have
*if it were the next variable added*, squares it, and reports
this as its "*F*-to-enter" statistic. At the next
step, the program automatically enters the variable with the highest
*F*-to-enter statistic, or removes the variable with the
lowest *F*-to-remove statistic, in accordance with certain
control parameters you have specified. So the key relation to
remember is: **F = t-squared**

On the right-mouse-button
options of the Multiple Regression procedure
in Statgraphics, you can specify the method as "Forward"
or "Backward," and also specify *threshold values*
for *F*-to-enter and *F*-to-remove. (You can also specify
"None" for the method--which is the default setting--in
which case it just performs a straight multiple regression using
all the variables.) The program then proceeds automatically. Under the *forward*
method, at each step, it *enters the
variable with the largest F-to-enter statistic*, provided that
this is *greater *than the threshold value for *F*-to-enter.
When there are no variables left to enter whose *F*-to-enter
statistics are above the threshold, it checks to see whether the
*F*-to-remove statistics of any variables added previously
have fallen *below* the *F*-to-remove threshold. If
so, it removes the worst of them, and then tries to continue.
It finally stops when no variables either in or out of the model
have *F*-statistics on the wrong side of their respective
thresholds. The *backward* method is similar in spirit, except
it starts with all variables in the model and sucessively *removes
the variable with the smallest F-to-remove statistic*, provided
that this is *less *than the threshold value for *F*-to-remove.

Whenever a variable is entered,
its new *F*-to-remove statistic
is initially the same as its old *F*-to-enter statistic,
but the *F*-to-enter and *F*-to-remove statistics of
the *other* variables will generally all change. (Similarly,
when a variable is removed, its new *F*-to-enter statistic
is initially the same as its old *F*-to-remove statistic.)
Until the *F*-to-enter and *F*-to-remove statistics
of the other variables are recomputed, it is impossible to tell
what the *next *variable to enter or remove will be. Hence,
this process is *myopic*, looking only one step forward or
backward at any point.

There is no guarantee that the *best
*model that can be constructed
from the available variables (or even a *good* model) will
be found by this one-step-ahead search procedure. Hence, when
the procedure terminates, you should study the sequence of variables
added and deleted (shown in the Analysis Summary report, below
the estimated coefficients), think about whether the variables
that were included or excluded make sense, and ask yourself if
perhaps the addition or removal of a few more variables might
not lead to improvement. For example, the variable with the lowest
*F*-to-remove or highest *F*-to-enter may have just
missed the threshold value, in which case you may wish to tweak
the *F*-values and see what happens. Sometimes adding a variable
with a marginal *F*-to-enter statistic, or removing one with
a marginal *F*-to-remove statistic, can cause the *F*-to-enter
statistics of other variables *not* in the model to go *up*
and/or or the *F*-to-remove statistics of other variables
*in* the model to go *down*, triggering a new chain
of entries or removals leading to a very different model.

While you're studying the
sequence of variables entered or removed,
you should also watch the value of the *adjusted R-squared *of
the model, which is one of the statistics shown. Usually it should
get consistently larger as the stepwise process works its magic,
but *sometimes it may start getting smaller* again. In this
case you should make a note of which variables were in the model
when adjusted R-squared hit its largest value--you may wish to
return to this model later on by manually entering or removing
variables.

**Warning #1:** For all the
models traversed in the same stepwise
run, Statgraphics uses the *same estimation period*, namely
the set of observations for which *all* variables listed
on the original input screen are "present" (non-missing).
Therefore, be careful about including variables which have many
fewer observations than the other variables, such as seasonal
lags or differences, because they will shorten the test period
for all models whether they appear in them or not, and regardless
of whether "forward" or "backward" mode is
used. After selecting your final model, you may wish to return
to the original input panel, erase the names of all variables
that weren't used in the final model, then re-fit the model to
be sure that the longest possible estimation period was used.

**Warning #2: **If the
number of variables listed on the original
input screen is *large* compared to the number of observations
in your data set (say, more than 1 variable for every 10 observations),
or if there is excessive multicollinearity (linear dependence)
among the variables, then the stepwise algorithm may go crazy
and end up throwing nearly all the variables into the model, especially
if you used a low *F*-to-enter or *F*-to-remove threshold.
Watch out for this.

**Warning #3:** Remember
that *the computer is not necessarily
right* in its choice of a model during the automatic phase
of the search. Don't accept a model just because the computer
gave it its blessing. Use your own judgment and intuition about
your data to try to fine-tune whatever the computer comes up with.

**What method should you use:
forward or backward?** If you
have a very *large* set of potential independent variables
from which you wish to *extract* a few--i.e., if you're on
a fishing expedition--you should generally go *forward*.
If, on the other hand, if you have a *modest-sized* set of
potential variables from which you wish to *eliminate* a
few--i.e., if you're fine-tuning some prior selection of variables--you
should generally go *backward*. (If you're on a fishing
expedition,
you should still be careful not to cast too wide a net, lest you
dredge up variables that are only accidentally related to your
dependent variable.)

**What values should you use
for the F-to-enter and F-to-remove
thresholds? **As noted above, after Statgraphics completes a
*forward run* based on the *F*-to-enter threshold, it
takes a *backward look* based on the *F*-to-remove
threshold,
and vice versa. Hence, *both *thresholds come into play
regardless
of which method you are using, and *the F-to-enter threshold
must be greater than or equal to the F-to-remove threshold*
(to prevent cycling). Usually the two thresholds are set to the
same value. Keeping in mind that the *F*-statistics are *squares*
of corresponding *t*-statistics, an *F*-statistic equal
to 4 would correspond to a *t*-statistic equal to 2, which
is the usual rule-of-thumb value for "significance at the
5% level." (4 is the default value for both thresholds.)
I recommend using a somewhat smaller threshold value than 4 for
the automatic phase of the search--for example 3.5 or 3.
Since
the automatic stepwise algorithm is myopic, it is usually OK to let it
enter a few too many variables in the model, and then you can
weed out the marginal ones later on by hand. However, *beware
of using too low an F threshold if the number of variables is
large compared to the number of observations, or if there is a
problem with multicollinearity in your data (see warning #2 above)*.
I would urge you to resist the urge to use an F-threshold smaller
than 3--often this opens the gates to a horde of spurious
regressors--and you should never forget to "manually"
apply your usual standards of significance to the variables in
the model at the end of the run.

**Just in case you're curious
about how it's done:*** *At
each step in the stepwise process, the program must effectively
fit a multiple regression model to the variables *in* the
model in order to obtain their *F*-to-remove statistics,
and it must effectively fit a separate regression model for *each*
of the variables *not* in the model in order to obtain their
*F*-to-enter statistics. When watching all this happen almost
instantaneously in Statgraphics, you may wonder how it is done
so *fast*. The secret is that it *doesn't* have to fit
all these models from scratch, and it doesn't need to reexamine
all the observations of each variable. Instead, the stepwise search
process can be carried out merely by performing a sequence of
simple transformations on the correlation matrix of the variables.
The variables are only read in once (when you first hit OK after
filling in the input panel), and their correlation matrix is then
computed (which may take a few seconds if there are very many
variables). After this--i.e., once you get to the right-mouse-button
options where the* *stepwise process can be enabled--the
sequence of adding or removing variables and recomputing the *F*-statistics
requires only a simple updating operation on the correlation matrix.
(This operation is called "sweeping," and it is similar
to the "pivoting" operation that is at the heart of
the simplex method of linear programming, if that means anything
to you.) The computational simplicity of the stepwise regression
algorithm re-emphasizes the fact that, in fitting a multiple regression
model, the *only* information extracted from the data is
the correlation matrix of the variables and their individual means
and standard deviations. The same computational trick is **used
in all-possible-regressions.**

**All-possible-regressions**
goes beyond stepwise regression
and literally tests all possible subsets of the set of potential
independent variables. (This is the "Regression Model Selection"
procedure in the Advanced Regression module of
Statgraphics.) If there are K potential independent variables
(besides the constant), then there are 2^K-1 distinct subsets
of them to be tested (counting the full set but excluding the
empty set which corresponds to the mean model). For example, if
you have 10 potential independent variables, the number of subsets
to be tested is 2^10 - 1 = 1023, and if you have 20 potential
independent variables, the number is 2^20, or more than one million.
The former might take a few seconds to run on a desktop computer, the
latter might take hours--and 30 variables would probably be out
of the question! All-possible-regressions carries all the caveats
of stepwise regression, *and more so.* This kind of data-mining
is not guaranteed to yield the model which is truly best for your
data, and it may lead you to get absorbed in top-10 rankings instead
of carefully articulating and validating the assumptions underlying
your model. If you insist on using this technique--particularly
with large numbers of independent variables--it behooves you to
take the rankings with a grain of salt and do some out-of-sample
testing later.

When using an
all-possible-regressions procedure, you are typically
given the choice between several numerical criteria on which to
rank the models. The two most commonly used are adjusted R-squared
and the Mallows "Cp" statistic. The latter statistic
is related to adjusted R-squared, but includes a heavier penalty
for increasing the number of independent variables. Cp is not
measured on a 0-to-1 scale. Rather, its values are typically positive
and greater than 1, and *lower values are better*. The models
which yield the best (lowest) values of Cp will tend to be similar
to those that yield the best (highest) values of adjusted R-squared,
but the exact ranking may be slightly different. Other things
being equal, the Cp criterion tends to favor models with fewer
parameters, so it is perhaps more robust to over-fitting the data.
Generally you look at the plots of R-squared and Cp versus the
number of variables to see (a) where the point of diminishing
returns is reaching in terms of the number of variables, and (b)
are there one or two models that seem to stand out above the crowd,
or are do there appear to be many equally good models. Then you
can look at the the actual rankings of models and try to find
the optimum place to make the "cut".

As with any ranking scheme, it's easy to get lost in the trees and lose sight of the forest: the differences in performance among the models near the top of the rankings are not substantial. (Don't forget that there are dozens, hundreds, or sometimes thousands of models down below!) An improvement in R-squared from, say, 65% to 70% is not necessarily worth the trouble of increasing the complexity of the model by adding several more variables. And there's a very real danger that automated data-mining will lead to the selection of a model which lacks an intuitive explanation and/or performs poorly out-of-sample.

Finally, remember that **a
model selected by automatic methods
can only find the "best" combination from among the
set of variables you start with**: if you omit some important
variables or *lags or transformations thereof*, no amount
of searching or ranking will compensate!

Among the various automatic model-selection methods, I find that I generally prefer stepwise to all-possible regressions. The stepwise approach is much faster, it's somewhat less prone to overfit the data, and it doesn't tend to drown you in details of rankings data that cause you to lose sight of the big picture.