RegressIt: free Excel add-in for
linear regression and multivariate data analysis

**Stepwise
regression**
is a semi-automated process of building a model by successively adding or
removing variables based solely on the *t*-statistics of their estimated
coefficients. Properly used, the stepwise regression option in Statgraphics (or
other stat packages) puts more power and information at your fingertips than
does the ordinary multiple regression option, and it is especially useful for
sifting through large numbers of potential independent variables and/or
fine-tuning a model by poking variables in or out. Improperly used, it may
converge on a poor model while giving you a false sense of security. It's a bit
like doing carpentry with a chain saw: you can get a lot work done quickly, but
you may end up doing more harm than good if you don't read the instructions,
remain sober, and keep a firm grip on the controls.

Suppose
you have some set of potential independent variables from which you wish to try
to extract the best subset for use in your forecasting model. (These are the
variables whose names you will type on the initial input screen.) The stepwise
option lets you either begin with *no* variables in the
model and proceed *forward* (adding one variable at a time), or
start with *all* potential variables in the model and proceed *backward*
(removing one variable at a time). At each step, the program performs the
following calculations: for each variable currently *in* the model, it
computes the *t*-statistic for its estimated coefficient, *squares*
it, and reports this as its "*F*-to-remove" statistic; for each
variable *not* in the model, it computes the *t*-statistic that its
coefficient *would* have *if it were the next variable added*,
squares it, and reports this as its "*F*-to-enter" statistic. At
the next step, the program automatically enters the variable with the highest *F*-to-enter
statistic, or removes the variable with the lowest *F*-to-remove
statistic, in accordance with certain control parameters you have specified. So
the key relation to remember is: **F = t-squared**

On the
right-mouse-button options of the Multiple Regression procedure in
Statgraphics, you can specify the method as "Forward" or
"Backward," and also specify *threshold values* for *F*-to-enter
and *F*-to-remove. (You can also specify "None" for the
method--which is the default setting--in which case it just performs a straight
multiple regression using all the variables.) The program then proceeds
automatically. Under the *forward* method, at each step, it *enters the
variable with the largest F-to-enter statistic*, provided that this is *greater
*than the threshold value for *F*-to-enter. When there are no variables
left to enter whose *F*-to-enter statistics are above the threshold, it
checks to see whether the *F*-to-remove statistics of any variables added
previously have fallen *below* the *F*-to-remove threshold. If so, it
removes the worst of them, and then tries to continue. It finally stops when no
variables either in or out of the model have *F*-statistics on the wrong
side of their respective thresholds. The *backward* method is similar in
spirit, except it starts with all variables in the model and sucessively *removes the variable with the smallest
F-to-remove statistic*, provided that this is *less *than the threshold
value for *F*-to-remove.

Whenever a
variable is entered, its new *F*-to-remove statistic is initially the same
as its old *F*-to-enter statistic, but the *F*-to-enter and *F*-to-remove
statistics of the *other* variables will generally all change. (Similarly,
when a variable is removed, its new *F*-to-enter statistic is initially
the same as its old *F*-to-remove statistic.) Until the *F*-to-enter
and *F*-to-remove statistics of the other variables are recomputed, it is
impossible to tell what the *next *variable to enter or remove will be.
Hence, this process is *myopic*, looking only one step forward or backward
at any point.

There is
no guarantee that the *best *model that can be constructed from the
available variables (or even a *good* model) will be found by this
one-step-ahead search procedure. Hence, when the procedure terminates, you
should study the sequence of variables added and deleted (shown in the Analysis
Summary report, below the estimated coefficients), think about whether the
variables that were included or excluded make sense, and ask yourself if
perhaps the addition or removal of a few more variables might not lead to
improvement. For example, the variable with the lowest *F*-to-remove or
highest *F*-to-enter may have just missed the threshold value, in which
case you may wish to tweak the *F*-values and see what happens. Sometimes
adding a variable with a marginal *F*-to-enter statistic, or removing one
with a marginal *F*-to-remove statistic, can cause the *F*-to-enter
statistics of other variables *not* in the model to go *up* and/or or
the *F*-to-remove statistics of other variables *in* the model to go *down*,
triggering a new chain of entries or removals leading to a very different
model.

While
you're studying the sequence of variables entered or removed, you should also
watch the value of the *adjusted R-squared *of the model, which is one of
the statistics shown. Usually it should get consistently larger as the stepwise
process works its magic, but *sometimes it may start getting smaller*
again. In this case you should make a note of which variables were in the model
when adjusted R-squared hit its largest value--you may wish to return to this
model later on by manually entering or removing variables.

**Warning
#1:**
For all the models traversed in the same stepwise run, Statgraphics uses the *same
estimation period*, namely the set of observations for which *all*
variables listed on the original input screen are "present"
(non-missing). Therefore, be careful about including variables which have many
fewer observations than the other variables, such as seasonal lags or
differences, because they will shorten the test period for all models whether
they appear in them or not, and regardless of whether "forward" or
"backward" mode is used. After selecting your final model, you may
wish to return to the original input panel, erase the names of all variables
that weren't used in the final model, then re-fit the model to be sure that the
longest possible estimation period was used.

**Warning
#2: **If
the number of variables listed on the original input screen is *large*
compared to the number of observations in your data set (say, more than 1
variable for every 10 observations), or if there is excessive multicollinearity
(linear dependence) among the variables, then the stepwise algorithm may go
crazy and end up throwing nearly all the variables into the model, especially
if you used a low *F*-to-enter or *F*-to-remove threshold. Watch out
for this.

**Warning
#3:**
Remember that *the computer is not necessarily right* in its choice of a
model during the automatic phase of the search. Don't accept a model just
because the computer gave it its blessing. Use your own judgment and intuition
about your data to try to fine-tune whatever the computer comes up with.

**What
method should you use: forward or backward?** If you have a very *large* set
of potential independent variables from which you wish to *extract* a
few--i.e., if you're on a fishing expedition--you should generally go *forward*.
If, on the other hand, if you have a *modest-sized* set of potential
variables from which you wish to *eliminate* a few--i.e., if you're
fine-tuning some prior selection of variables--you should generally go *backward*.
(If you're on a fishing expedition, you should still be careful not to cast too
wide a net, lest you dredge up variables that are only accidentally related to
your dependent variable.)

**What
values should you use for the F-to-enter and F-to-remove thresholds? **As noted above,
after Statgraphics completes a *forward run* based on the *F*-to-enter
threshold, it takes a *backward look* based on the *F*-to-remove
threshold, and vice versa. Hence, *both *thresholds come into play
regardless of which method you are using, and *the F-to-enter threshold must
be greater than or equal to the F-to-remove threshold* (to prevent cycling).
Usually the two thresholds are set to the same value. Keeping in mind that the *F*-statistics
are *squares* of corresponding *t*-statistics, an *F*-statistic
equal to 4 would correspond to a *t*-statistic equal to 2, which is the
usual rule-of-thumb value for "significance at the 5% level." (4 is
the default value for both thresholds.) I recommend using a somewhat smaller
threshold value than 4 for the automatic phase of the search--for example 3.5
or 3. Since the automatic stepwise algorithm is myopic, it is
usually OK to let it enter a few too many variables in the model, and then you
can weed out the marginal ones later on by hand. However, *beware of using
too low an F threshold if the number of variables is large compared to the number
of observations, or if there is a problem with multicollinearity in your data
(see warning #2 above)*. I would urge you to resist the urge to use an
F-threshold smaller than 3--often this opens the gates to a horde of spurious regressors--and you should never forget to
"manually" apply your usual standards of significance to the
variables in the model at the end of the run.

**Just in
case you're curious about how it's done:*** *At each step in the stepwise
process, the program must effectively fit a multiple regression model to the
variables *in* the model in order to obtain their *F*-to-remove
statistics, and it must effectively fit a separate regression model for *each*
of the variables *not* in the model in order to obtain their *F*-to-enter
statistics. When watching all this happen almost instantaneously in
Statgraphics, you may wonder how it is done so *fast*. The secret is that
it *doesn't* have to fit all these models from scratch, and it doesn't
need to reexamine all the observations of each variable. Instead, the stepwise
search process can be carried out merely by performing a sequence of simple
transformations on the correlation matrix of the variables. The variables are
only read in once (when you first hit OK after filling in the input panel), and
their correlation matrix is then computed (which may take a few seconds if
there are very many variables). After this--i.e., once you get to the
right-mouse-button options where the* *stepwise process can be
enabled--the sequence of adding or removing variables and recomputing
the *F*-statistics requires only a simple updating operation on the
correlation matrix. (This operation is called "sweeping," and it is
similar to the "pivoting" operation that is at the heart of the
simplex method of linear programming, if that means anything to you.) The
computational simplicity of the stepwise regression algorithm re-emphasizes the
fact that, in fitting a multiple regression model, the *only* information
extracted from the data is the correlation matrix of the variables and their
individual means and standard deviations. The same computational trick is **used
in all-possible-regressions.**

**All-possible-regressions** goes beyond
stepwise regression and literally tests all possible subsets of the set of
potential independent variables. (This is the "Regression Model
Selection" procedure in the Advanced Regression module of Statgraphics.)
If there are K potential independent variables (besides the constant), then
there are 2^K-1 distinct subsets of them to be tested (counting the full set
but excluding the empty set which corresponds to the mean model). For example,
if you have 10 potential independent variables, the number of subsets to be
tested is 2^10 - 1 = 1023, and if you have 20 potential independent variables,
the number is 2^20, or more than one million. The former might take a few
seconds to run on a desktop computer, the latter might take hours--and 30
variables would probably be out of the question! All-possible-regressions
carries all the caveats of stepwise regression, *and more so.* This kind
of data-mining is not guaranteed to yield the model which is truly best for
your data, and it may lead you to get absorbed in top-10 rankings instead of
carefully articulating and validating the assumptions underlying your model. If
you insist on using this technique--particularly with large numbers of
independent variables--it behooves you to take the rankings with a grain of
salt and do some out-of-sample testing later.

When using
an all-possible-regressions procedure, you are typically given the choice
between several numerical criteria on which to rank the models. The two most
commonly used are adjusted R-squared and the Mallows "Cp"
statistic. The latter statistic is related to adjusted
R-squared, but includes a heavier penalty for increasing the number of
independent variables. Cp is not measured on a 0-to-1
scale. Rather, its values are typically positive and greater than 1, and *lower
values are better*. The models which yield the best (lowest) values of Cp will tend to be similar to those that yield the best
(highest) values of adjusted R-squared, but the exact ranking may be slightly
different. Other things being equal, the Cp criterion
tends to favor models with fewer parameters, so it is perhaps more robust to
over-fitting the data. Generally you look at the plots of R-squared and Cp versus the number of variables to see (a) where the
point of diminishing returns is reaching in terms of the number of variables,
and (b) are there one or two models that seem to stand out above the crowd, or
are do there appear to be many equally good models. Then you can look at the the actual rankings of models and try to find the optimum
place to make the "cut".

As with
any ranking scheme, it's easy to get lost in the trees and lose sight of the
forest: the differences in performance among the models near the top of the
rankings are not substantial. (Don't forget that there are dozens, hundreds, or
sometimes thousands of models down below!) An improvement in R-squared from,
say, 65% to 70% is not necessarily worth the trouble of increasing the
complexity of the model by adding several more variables. And there's a very
real danger that automated data-mining will lead to the selection of a model
which lacks an intuitive explanation and/or performs poorly out-of-sample.

Finally,
remember that **a model selected by automatic methods can only find the
"best" combination from among the set of variables you start with**:
if you omit some important variables or *lags or transformations thereof*,
no amount of searching or ranking will compensate!

Among the various automatic model-selection methods, I find that I generally prefer stepwise to all-possible regressions. The stepwise approach is much faster, it's somewhat less prone to overfit the data, and it doesn't tend to drown you in details of rankings data that cause you to lose sight of the big picture.

**Related topics:**

·
Introduction to linear
regression

·
What to look for in regression
output

·
Testing the assumptions of
linear regression

·
What’s the bottom
line: how to compare models

·
Additional notes on regression
analysis