Notes on linear
regression analysis (pdf file)

Introduction
to linear regression analysis

Regression
example, part 1: descriptive analysis

Regression
example, part 2: fitting a simple model

Regression
example, part 3: transformations of variables

What to look for in
regression output

What’s a good
value for R-squared?

What's the bottom line? How to compare models

Testing the assumptions of linear regression

Additional notes on regression analysis

Spreadsheet with
regression formulas (new version including RegressIt output)

Stepwise and all-possible-regressions

RegressIt: free Excel add-in for
linear regression and multivariate data analysis

Standard error of the regression and other measures of error
size

Adjusted R-squared (not the bottom line!)

Significance of the estimated coefficients

Values of the estimated coefficients

Plots of forecasts and residuals (important!)

Out-of-sample validation

**(i)
Standard error of the regression (root-mean-squared error adjusted for degrees
of freedom): ** Does the current regression model yield
smaller errors, on average, than the best model previously fitted, and is the
improvement significant in *practical*
terms? In regression modeling, the
best single error statistic to look at is the standard error of the regression (which
is approximately the standard deviation of the errors), insofar as this what
your software is trying to minimize when estimating coefficients, and it is a
sufficient statistic for describing properties of the errors if the
model’s assumptions are all correct.
In the special case of time series data, does the current model improve
on the best *naive* (random walk or random trend) model? And how has it been doing lately? Are its most recent errors typical in
size and random-looking, or are they getting bigger or more biased?

You should
directly compare the standard error of the regression between models *only*
if their units are the same and they are fitted to the same (or almost the
same) set of observations of the same dependent variable. RegressIt provides a Model Summary Report that
shows side-by-side comparisons of error measures and coefficient estimates for
models fitted to the same dependent variable, in order to make such comparisons
easy, although sample sizes may vary if there are missing values in any
independent variables that are not included in all models. The Forecasting
procedure in Statgraphics also provides a Model Comparison Report that shows
side-by-side error statistics, and they are measured in the units of the
original data, even if the models used different transformations of the
dependent variable (e.g., logging, differencing, deflation, etc.). That is, it reverses whatever data
transformations were used in order to convert forecasts back into original
units before computing comparative error statistics. The residual plots show the errors in
transformed units, though, because that is where visual tests for non-normality
and heteroscedasticity should be applied.

In time series
forecasting, it is common to look not only at root-mean-squared error but also
the **mean absolute error** **(MAE)** and, for positive data, the **mean absolute percentage error** **(MAPE)** in evaluating and comparing
model performance. The latter
measures are easier for non-specialists to understand and they are less
sensitive to extreme errors, if the occasional big mistake is not a serious
concern. Also, it is sometimes appropriate to compare MAPE between models
fitted to different samples of data, because it is a relative rather than
absolute measure. These two
statistics are not routinely reported by most regression software, however. (Return to top of page.)

**(ii)
Adjusted R-squared:**
This is R-squared (the fraction by which the variance of the errors is less than
the variance of the dependent variable) adjusted for the number of coefficients
in the model relative to the sample size in order to correct it for bias (the
same adjustment used in computing the standard error of the regression). That is, adjusted R-squared is the
fraction by which the square of the standard error of the regression is less
than the variance of the dependent variable. **It
is the most over-used and abused of all statistics--don't get obsessed with it.**
It is not the bottom line. All it measures is the percentage reduction in
mean-squared-error that the regression model achieves relative to the naive
model "Y=constant", which may or may not be the appropriate naive
model for purposes of comparison. Better to determine the *best* naive
model first, and then compare the various error measures of your regression
model (both in the estimation and validation periods) against that naive model.

Despite
the fact that adjusted R-squared is a unitless statistic, there is no absolute
standard for what is a "good" value. A regression model fitted to
non-stationary time series data can have an adjusted R-squared of 99% and yet
be inferior to a simple random walk model. On the other hand, a regression
model fitted to stationarized time series data might have an adjusted R-squared
of 10%-20% and be considered quite good. When working with stationary stock
return data, adjusted R-squared values as low as 5% might even be considered
significant if they hold up in out-of-sample validation (which they usually don't!) (Return to top of page.)

**(iii)
Significance of the estimated coefficients:** Are the t-statistics greater than 2
in magnitude, corresponding to p-values less than 0.05? If they are not, you should probably try
to *refit the model with the least significant variable excluded*, which
is the "backward stepwise" approach to model refinement.

Remember
that the t-statistic is just the *estimated coefficient divided by its own
standard error*. Thus, it measures "how many standard deviations from
zero" the estimated coefficient is, and it is used to test the hypothesis
that the true value of the coefficient is non-zero, in order to confirm that
the independent variable really belongs in the model.

The
p-value is the probability of observing a t-statistic that large or larger in
magnitude given the null hypothesis that the true coefficient value is zero. If
the p-value is greater than 0.05--which occurs roughly when the t-statistic is
less than 2 in absolute value--this means that the coefficient may be only
"accidentally" significant.

*There's nothing magical about the 0.05
criterion*,
but in practice it usually turns out that a variable whose estimated
coefficient has a p-value of greater than 0.05 can be dropped from the model
without affecting the error measures very much--try it and see! (Return
to top of page.)

**(iv)
Values of the estimated coefficients:** In general you are interested not only in
the *statistical* significance of an
independent variable, you are also interested in its *practical* significance. What does it imply in real terms? What have
you learned, and how should you spend your time or money? In theory, the
coefficient of a given independent variable is its proportional effect on the
average value of the dependent variable, others things being equal. In business and weapons-making, this is
often called "bang for the buck". Such information can be very useful
for decision-making if some of the independent variables are under your
control, for example, the amount of a drug administered to a patient, or the
amount of money spent on promoting a product. Keep in mind that when sample
sizes are very large, an effect that is really quite tiny (say, the marginal
benefit of an expensive new medical treatment) could appear to be quite large
if all you look at is its t-statistic!

In some
cases the interesting hypothesis is not whether the value of a certain coefficient
is equal to zero, but whether it is equal to some other value. For example, if
one of the independent variables is merely the dependent variable lagged by one
period (i.e., an autoregressive term), then the interesting question is whether
its coefficient is equal to *one*. If so, then the model is effectively
predicting the *difference* in the dependent variable, rather than
predicting its level, in which case you can simplify the model by differencing
the dependent variable and deleting the lagged version of itself from the list
of independent variables.

Sometimes
patterns in the magnitudes and *signs* of lagged variables are of
interest. For example if both X and LAG(X,1) are included in the model, and
their estimated coefficients turn out to have similar magnitudes but opposite
signs, this suggests that they could both be replaced by a single DIFF(X) term.
(Return to top of page.)

**(v)
Plots of forecasts and residuals:**
DO NOT FAIL TO GENERATE AND LOOK AT PLOTS OF THE FORECASTS AND ERRORS.
(Some software makes this hard: it
may be necessary to execute a separate procedure or write additional code in
order to produce a single plot, and even a small amount of extra work is
sometimes a barrier to careful analysis.)
Do the forecasts "track" the data in a satisfactory way, apart
from the inevitable regression-to-the mean? (In the case of time series data,
you are especially concerned with how the model fits the data at the
"business end", i.e., the most recent values.) Do the residuals
appear random, or do you see some systematic patterns in their signs or
magnitudes? Are they free from
trends, autocorrelation, and heteroscedasticity? Are they normally distributed?
There are a variety of statistical tests for these sorts of problems, but the
best way to determine whether they are present and whether they are serious is
to *look at the pictures*.

If
heteroscedasticity and/or non-normality is a problem, you may wish to consider
a nonlinear transformation of the dependent variable, such as logging or
deflating, if such transformations are appropriate for your data. (Remember
that logging converts multiplicative relationships to additive relationships,
so that when you log the dependent variable, you are implicitly assuming that
the relationships among the original variables are multiplicative.)

If
autocorrelation is a problem, you should probably consider changing the model
so as to implicitly or explicitly include *lagged variables*--e.g., try stationarizing
the dependent and independent variables via differencing, or add lags of the
dependent and/or independent variables to the regression equation, or introduce
an autoregressive error correction.
In Statgraphics, you can just enter DIFF(X) or LAG(X,1) as the variable
name if you want to use the first difference or 1-period-lagged value of X in
the input to a procedure. In
RegressIt, lagging and differencing are options on the Variable
Transformation menu. Of course,
when working in Excel, it is possible to use formulas to create transformed
variables of any kind, although there are advantages to letting the software do
it for you: it makes the process more user-friendly, it reduces the possibility
for error, and it makes the output self-documenting in terms of how transformed
variables were created.

Note: You
do not usually *rank *(i.e., choose among) models on the basis of their
residual diagnostic tests, but bad residual diagnostics indicate that the
model's error measures may be unreliable and that there are probably better
models out there somewhere. (Return to top of page.)

**(vi)
Out-of-sample validation:** If you have enough data to hold out a sizable portion
for validation (and if your software offers this feature), compare the
performance of the models in the validation as well as estimation periods. A
good model should have small error measures in *both* the estimation and
validation periods, compared to other models, and its validation period
statistics should be similar to its own estimation period statistics.
Regression models with many independent variables are especially susceptible to
overfitting the data in the estimation period, so watch out for models that
have suspiciously low error measures in the estimation period and
disappointingly high error measures in the validation period.

(Note: If
the variance of the errors in original, *untransformed* units is growing
over time due to inflation or compound growth, then the best statistic to use
for comparisons between the estimation and validation period is mean absolute *percentage*
error, rather than mean squared error or mean absolute error.)

Although
the model's performance in the validation period is *theoretically* the
best indicator of its forecasting accuracy, especially for time series data,
you should be aware that the hold-out sample may not always be highly
representative, especially if it is small--say, less than 20 observations. If
your validation period statistics appear strange or contradictory, you may wish
to experiment by changing the number of observations held out. Sometimes the
inclusion or exclusion of a few unusual observations can make a big a
difference in the comparative statistics of different models.

Also, be
aware that if you test a *large number of models* and rigorously rank them
on the basis of their validation period statistics, you may end up with just as
much "data snooping bias" as if you had only looked at
estimation-period statistics--i.e., you may end up picking a model that is more
lucky than good! The best defense against this is to choose the *simplest*
and most *intuitively plausible* model that gives comparatively good
results. (Return to top of page.)