As an example of the use of regression analysis for forecasting, let's consider the possibility of using another macroeconomic variable such as personal income to help us forecast auto sales. Personal income is chosen here as a predictor variable for two reasons: (i) it was advocated as a predictor of auto sales in a textbook previously used in this course, and (ii) it has been popular as a predictor variable for all kinds of things in student projects in this course in the past. The intuition behind using income as a predictor variable is obvious: the more income that consumers have to spend, the more money they will spend on automobiles and everything else--right? So let's see how well we can do with it.

As we have already seen, auto sales is a strongly seasonal variable, whereas personal income is not. We are by now familiar with the use of seasonal adjustment to account for seasonality in a forecasting model, so we will work with seasonally adjusted auto sales. We can use the Seasonal Decomposition procedure in Statgraphics to compute and store the seasonally adjusted values of AUTOSALE under another name--say, AUTOADJ. Meanwhile, personal income has been stored under the name INCOME. The first step in our analysis of the effect of INCOME on AUTOADJ should be to draw some plots. Here is a time series plot of both variables, as drawn by the Multiple XY Plot procedure:

The trend in INCOME is seen to match the trend in AUTOADJ very closely, although AUTOADJ seems to have a more pronounced cyclical pattern. But wait--both these variables are measured in nominal dollars. Perhaps inflation was responsible for much of the common trend. Let's deflate both series by the Consumer Price Index (1983=1.0) to see what happened in real terms:

We now see evidence of cyclical behavior in both series, although
it is still somewhat stronger in the AUTOADJ series. Next let's
ask: is there a significant *linear *relationship between
these two variables? A scatter plot (i.e., a plot of AUTOADJ/CPI
versus INCOME/CPI, drawn with the X-Y Plot procedure) will shed
light on this question:

Clearly there is strong evidence of a linear relationship. Let's proceed, then, to fit a linear regression model...

There are a number of different procedures which we could use
in Statgraphics to fit a simple regression model: Simple Regression,
Multiple Regression, Advanced Regression (in version 2.1), or
the Forecasting procedure. Let's try the usual all-purpose workhorse,
namely the Multiple Regression procedure. Now, the available data
extends from January 1970 through February 1996. In all of our
subsequent analysis, we will hold out the last 26 observations--i.e.,
everything from January 1994 onward--for purposes of validation.
(We have already used *all* the data to estimate seasonal
indices, but never mind that complication.) In the Multiple Regression
procedure, we can hold out the 1994-96 data by entering YEAR<1994
in the "Select" field on the Data Input panel.

Upon specifying AUTOADJ/CPI as the dependent variable and INCOME
as the independent variable, the Analysis Summary report (click here)
gives us the standard summary statistics for a regression model.
(In Statgraphics version 2, it also includes comments from the
StatAdvisor.) Note that the **R-squared** value appears quite
satisfactory: 72.7254%. In other words, by using INCOME/CPI as
a predictor, we have explained nearly 73% of the variance in AUTOADJ/CPI.
And as we would expect with such a high R-squared, the **estimated
coefficient** of INCOME/CPI is *very* significantly different
from zero: its **t-statistic** is greater than 27, whereas
anything greater than 2 in magnitude is normally considered significant.
The estimated coefficient is 0.0782192, whereas the standard error
of the coefficient is only 0.0028257. **The t-statistic is equal
to the estimated coefficient divided by its standard error**,
and hence represents the "number of standard errors from
zero."

The "Interval Plot" option gives us a plot of the fitted regression line superimposed on the scatter plot, which visually confirms the strong linear relationship:

So, is this a good model for forecasting? There are a few more
things we should look at before concluding that it is, if we are
careful. For example, the **Durbin-Watson statistic **in the
Analysis Summary report is 0.449251. The DW statistic tests for
the presence of significant autocorrelation--also known as **serial
correlation--**at lag 1, and a "good" value for the
DW statistic is something close to 2.0. I have no idea why this
statistic is ubiquitous in regression software: the program could
just as well report the lag-1 autocorrelation coefficient, or
even better, show you a graph of the residual autocorrelation
function! The DW stat is roughly equal to 2(1-a), where a is the
lag-1 autocorrelation coefficient. As a very rough rule of thumb,
you should be suspicious of a DW stat that is less than 1.4 (corresponding
to a lag-1 autocorrelation greater than 0.3) or a DW stat that
is greater than 2.6 (corresponding to a lag-1 autocorrelation
less than -0.3.) There is nothing magical about these values--in
fact, smaller tolerances should be used for samples sizes larger
than 50, as we have here. The StatAdvisor has already sounded
an alarm, commenting that "The Durbin-Watson (DW) statistic
tests the residuals to determine if there is any significant correlation
based on the order in which they occur in your data file. Since
the DW value is less than 1.4, there may be some indication of
serial correlation. Plot the residuals versus row order to see
if there is any pattern which can be seen." Plotting residuals
versus row number (i.e., versus time) is *always* a good
idea when you are dealing with time series data, and here is what
the plot looks like in this case:

Yike! There is a rather serious problem here: the residuals clearly
have a *very* strong pattern of positive autocorrelation--notice
the long runs of errors with the same sign--which is perhaps a
result of the INCOME variable not fully explaining the cyclical
variations in AUTOADJ that we commented upon at the outset.

We might have noticed this problem earlier, and been in a better
position to deal with it, if we had used the Forecasting procedure
instead. The forecasting procedure includes many more tools for
manipulating and analyzing time series data. To fit a regression
model in the Forecasting procedure, set the Model Type to "Mean"
and then hit the "Regression" button. You then have
an opportunity to specify independent variables to be added to
the forecasting equation (in addition to a constant term). One
thing to watch out for: *if you are going to use regressor (independent)
variables, you cannot request any forecasts for the future to
be generated unless values for the regressors are available for
those periods.* In this case, we do not have any future data
on INCOME/CPI (a minor problem to which we shall return later),
so we will not generate any forecasts into the future. However,
we *will* hold out 26 values for validation, so that only
data prior to 1994 is used to fit the model, as in the Multiple
Regression procedure. The Analysis Summary report (click here)
shows many of the same summary statistics that we saw before,
with the notable exception of R-squared--which is no great loss!
Of course, the estimated coefficients and error statistics in
the estimation period are exactly the same as before.

One thing that this report includes which we did *not* see
before is a comparison of the model's performance in the estimation
and validation periods: the Mean Absolute Error in the estimation
period is 1.64644, whereas it rises to 3.05438 in the validation
period. The truth about the model's performance in the validation
period is even worse, as we see when we look at a plot of the
actual values and forecasts:

Not only are the errors bigger, on average, in the validation
period than in the forecast period, but in fact every single forecast
in the validation period (1994 and beyond) is significantly *below*
the actual value, and getting farther away as time goes on. This
is serial correlation with a vengeance! (By the way, you may notice
that this plot looks an awful lot like the Multiple XY plot of
the two input variables that we drew earlier. In fact, it is *exactly*
the same as the earlier plot except that the INCOME/CPI variable
has merely been rescaled and labeled as the "forecast"
for AUTOADJ/CPI: this is precisely what goes in a simple regression
model.)

The Forecasting procedure of course includes an autocorrelation plot of the residuals, so we can see the full dimension of the problem:

This is as bad a residual autocorrelation plot as you would ever hope to see!

As if we haven't heaped enough abuse on this poor model, there
is one more unflattering comparison we can make: let's compare
it to some of the "boring" time-series models we have
considered previously: the **random walk (with and without growth),
simple exponential smoothing, and linear exponential smoothing
models**. The Model Comparison Report (click here)
shows something remarkable: all of the simple time series models
dramatically outperform the regression model, despite the latter's
impressive R-squared! The two best models appear to be the exponential
smoothing models: the simple exponential smoothing model does
slightly better in the estimation period, while the linear exponential
smoothing model does slightly better in the validation period,
perhaps because of the consistent upward trend in the latter period.
(The growth term does not appear to add much to the random walk
model, although presumably it would be better for longer-horizon
forecasts.) For example, the MAE for the linear exponential smoothing
in the validation period is only 0.84, versus 3.05 for the regression
model. Here is a plot of the forecasts for the linear exponential
smoothing model:

...and here are the residual autocorrelations--rather more satisfactory than those of the regression model!

The morals of this story (so far) are:

**A high r-squared does not necessarily signify a "good" regression model for forecasting purposes**.**A simple time series model may strongly outperform a more complicated regression model**.

What went wrong with the regression model? A variety of excuses might be found to explain its poor performance. For example, we noticed at the very beginning that the INCOME variable did not seem to have the same kind of cyclical behavior as the AUTOSALE variable--perhaps there are other economic indicator variables that could be added to the regression model to better capture the cyclicality. This is the "omitted variable" explanation which is frequently invoked to explain poor regression model performance.

But there are some deeper lessons here. One is that **it is
often perilous to regress one nonstationary time series on another
nonstationary time series, particularly if both have significant
trends**. No doubt you will obtain a high R-squared, but this
does not necessarily mean anything in such a case. Recall that
R-squared represents the "percent of variance explained"
in the dependent variable. Now, a variable which is nonstationary--e.g.,
a variable which is a true random walk and/or has a persistent
trend--does not have a "true" variance. The *sample*
variance merely grows as the sample size grows, and if the sample
size went to infinity (e.g., if we considered the "asymptotic"
properties of the model), the variance would also go to infinity.
Since the whole concept of a well-defined variance is questionable
for such a series, the concept of "percent of variance explained"
is questionable as well.

For example, take *any* two series with strong upward trends,
say, U.S. retail sales of automobiles (in nominal dollars) and
the population of Pakistan. If you compute their coefficient
of correlation (i.e., "r"), it may be greater than 0.95.
And if you regress one on the other, you may get an R-squared
greater than 90%. Does this mean one is a good predictor of the
other? Obviously not--the high R-squared merely means that one
series with a trend is much better predicted by another series
with a trend than by a "constant" model. (Remember
that R-squared essentially measures the reduction in variance
compared with the constant model.) But, **in such a case, you
could probably do even better--perhaps very much better--by
using a model that predicted the series from its own history,
such as a random walk, exponential smoothing, or ARIMA model**.

Does this mean that regression is not a useful forecasting technique?
Not at all! It just means that when you working with *time
series* data, you need to be aware that **a regression model
may fail to exploit the "time dimension" unless the
variables are carefully chosen. In particular, you may wish to
consider using lagged and/or differenced variables in the forecasting
equation**, so that some of the *history* of the dependent
and/or independent variables, as well as their current values,
is used in the forecast.