Example of regression analysis: predicting auto sales from personal incom

[Warning:  this is an example of a bad regression model.  For examples of both bad and good models, and how transformations of variables are sometimes needed to get from one to the other, see the beer sales analysis on the RegressIt web site.]

As an example of the use of regression analysis for forecasting, let's consider the possibility of using another macroeconomic variable such as personal income to help us forecast auto sales. Personal income is chosen here as a predictor variable for two reasons: (i) it was advocated as a predictor of auto sales in a textbook previously used in this course, and (ii) it has been popular as a predictor variable for all kinds of things in student projects in this course in the past. The intuition behind using income as a predictor variable is obvious: the more income that consumers have to spend, the more money they will spend on automobiles and everything else--right? So let's see how well we can do with it.

As we have already seen, auto sales is a strongly seasonal variable, whereas personal income is not. We are by now familiar with the use of seasonal adjustment to account for seasonality in a forecasting model, so we will work with seasonally adjusted auto sales. We can use the Seasonal Decomposition procedure in Statgraphics to compute and store the seasonally adjusted values of AUTOSALE under another name--say, AUTOADJ. Meanwhile, personal income has been stored under the name INCOME. The first step in our analysis of the effect of INCOME on AUTOADJ should be to draw some plots. Here is a time series plot of both variables, as drawn by the Multiple XY Plot procedure:

The trend in INCOME is seen to match the trend in AUTOADJ very closely, although AUTOADJ seems to have a more pronounced cyclical pattern. But wait--both these variables are measured in nominal dollars. Perhaps inflation was responsible for much of the common trend. Let's deflate both series by the Consumer Price Index (1983=1.0) to see what happened in real terms:

We now see evidence of cyclical behavior in both series, although it is still somewhat stronger in the AUTOADJ series. Next let's ask: is there a significant linear relationship between these two variables? A scatter plot (i.e., a plot of AUTOADJ/CPI versus INCOME/CPI, drawn with the X-Y Plot procedure) will shed light on this question:

Clearly there is strong evidence of a linear relationship. Let's proceed, then, to fit a linear regression model...

There are a number of different procedures which we could use in Statgraphics to fit a simple regression model: Simple Regression, Multiple Regression, Advanced Regression (in version 2.1), or the Forecasting procedure. Let's try the usual all-purpose workhorse, namely the Multiple Regression procedure. Now, the available data extends from January 1970 through February 1996. In all of our subsequent analysis, we will hold out the last 26 observations--i.e., everything from January 1994 onward--for purposes of validation. (We have already used all the data to estimate seasonal indices, but never mind that complication.) In the Multiple Regression procedure, we can hold out the 1994-96 data by entering YEAR<1994 in the "Select" field on the Data Input panel.

Upon specifying AUTOADJ/CPI as the dependent variable and INCOME as the independent variable, the Analysis Summary report (click here) gives us the standard summary statistics for a regression model. (In Statgraphics version 2, it also includes comments from the StatAdvisor.) Note that the R-squared value appears quite satisfactory: 72.7254%. In other words, by using INCOME/CPI as a predictor, we have explained nearly 73% of the variance in AUTOADJ/CPI. And as we would expect with such a high R-squared, the estimated coefficient of INCOME/CPI is very significantly different from zero: its t-statistic is greater than 27, whereas anything greater than 2 in magnitude is normally considered significant. The estimated coefficient is 0.0782192, whereas the standard error of the coefficient is only 0.0028257. The t-statistic is equal to the estimated coefficient divided by its standard error, and hence represents the "number of standard errors from zero."

The "Interval Plot" option gives us a plot of the fitted regression line superimposed on the scatter plot, which visually confirms the strong linear relationship:

So, is this a good model for forecasting? There are a few more things we should look at before concluding that it is, if we are careful. For example, the Durbin-Watson statistic in the Analysis Summary report is 0.449251. The DW statistic tests for the presence of significant autocorrelation--also known as serial correlation--at lag 1, and a "good" value for the DW statistic is something close to 2.0. I have no idea why this statistic is ubiquitous in regression software: the program could just as well report the lag-1 autocorrelation coefficient, or even better, show you a graph of the residual autocorrelation function! The DW stat is roughly equal to 2(1-a), where a is the lag-1 autocorrelation coefficient. As a very rough rule of thumb, you should be suspicious of a DW stat that is less than 1.4 (corresponding to a lag-1 autocorrelation greater than 0.3) or a DW stat that is greater than 2.6 (corresponding to a lag-1 autocorrelation less than -0.3.) There is nothing magical about these values--in fact, smaller tolerances should be used for samples sizes larger than 50, as we have here. The StatAdvisor has already sounded an alarm, commenting that "The Durbin-Watson (DW) statistic tests the residuals to determine if there is any significant correlation based on the order in which they occur in your data file. Since the DW value is less than 1.4, there may be some indication of serial correlation. Plot the residuals versus row order to see if there is any pattern which can be seen." Plotting residuals versus row number (i.e., versus time) is always a good idea when you are dealing with time series data, and here is what the plot looks like in this case:

Yike! There is a rather serious problem here: the residuals clearly have a very strong pattern of positive autocorrelation--notice the long runs of errors with the same sign--which is perhaps a result of the INCOME variable not fully explaining the cyclical variations in AUTOADJ that we commented upon at the outset.

We might have noticed this problem earlier, and been in a better position to deal with it, if we had used the Forecasting procedure instead. The forecasting procedure includes many more tools for manipulating and analyzing time series data. To fit a regression model in the Forecasting procedure, set the Model Type to "Mean" and then hit the "Regression" button. You then have an opportunity to specify independent variables to be added to the forecasting equation (in addition to a constant term). One thing to watch out for: if you are going to use regressor (independent) variables, you cannot request any forecasts for the future to be generated unless values for the regressors are available for those periods. In this case, we do not have any future data on INCOME/CPI (a minor problem to which we shall return later), so we will not generate any forecasts into the future. However, we will hold out 26 values for validation, so that only data prior to 1994 is used to fit the model, as in the Multiple Regression procedure. The Analysis Summary report (click here) shows many of the same summary statistics that we saw before, with the notable exception of R-squared--which is no great loss! Of course, the estimated coefficients and error statistics in the estimation period are exactly the same as before.

One thing that this report includes which we did not see before is a comparison of the model's performance in the estimation and validation periods: the Mean Absolute Error in the estimation period is 1.64644, whereas it rises to 3.05438 in the validation period. The truth about the model's performance in the validation period is even worse, as we see when we look at a plot of the actual values and forecasts:

Not only are the errors bigger, on average, in the validation period than in the forecast period, but in fact every single forecast in the validation period (1994 and beyond) is significantly below the actual value, and getting farther away as time goes on. This is serial correlation with a vengeance! (By the way, you may notice that this plot looks an awful lot like the Multiple XY plot of the two input variables that we drew earlier. In fact, it is exactly the same as the earlier plot except that the INCOME/CPI variable has merely been rescaled and labeled as the "forecast" for AUTOADJ/CPI: this is precisely what goes in a simple regression model.)

The Forecasting procedure of course includes an autocorrelation plot of the residuals, so we can see the full dimension of the problem:

This is as bad a residual autocorrelation plot as you would ever hope to see!

As if we haven't heaped enough abuse on this poor model, there is one more unflattering comparison we can make: let's compare it to some of the "boring" time-series models we have considered previously: the random walk (with and without growth), simple exponential smoothing, and linear exponential smoothing models. The Model Comparison Report (click here) shows something remarkable: all of the simple time series models dramatically outperform the regression model, despite the latter's impressive R-squared! The two best models appear to be the exponential smoothing models: the simple exponential smoothing model does slightly better in the estimation period, while the linear exponential smoothing model does slightly better in the validation period, perhaps because of the consistent upward trend in the latter period. (The growth term does not appear to add much to the random walk model, although presumably it would be better for longer-horizon forecasts.) For example, the MAE for the linear exponential smoothing in the validation period is only 0.84, versus 3.05 for the regression model. Here is a plot of the forecasts for the linear exponential smoothing model:

...and here are the residual autocorrelations--rather more satisfactory than those of the regression model!

The morals of this story (so far) are:

  1. A high r-squared does not necessarily signify a "good" regression model for forecasting purposes.
  2. A simple time series model may strongly outperform a more complicated regression model.

What went wrong with the regression model? A variety of excuses might be found to explain its poor performance. For example, we noticed at the very beginning that the INCOME variable did not seem to have the same kind of cyclical behavior as the AUTOSALE variable--perhaps there are other economic indicator variables that could be added to the regression model to better capture the cyclicality. This is the "omitted variable" explanation which is frequently invoked to explain poor regression model performance.

But there are some deeper lessons here. One is that it is often perilous to regress one nonstationary time series on another nonstationary time series, particularly if both have significant trends. No doubt you will obtain a high R-squared, but this does not necessarily mean anything in such a case. Recall that R-squared represents the "percent of variance explained" in the dependent variable. Now, a variable which is nonstationary--e.g., a variable which is a true random walk and/or has a persistent trend--does not have a "true" variance. The sample variance merely grows as the sample size grows, and if the sample size went to infinity (e.g., if we considered the "asymptotic" properties of the model), the variance would also go to infinity. Since the whole concept of a well-defined variance is questionable for such a series, the concept of "percent of variance explained" is questionable as well.

For example, take any two series with strong upward trends, say, U.S. retail sales of automobiles (in nominal dollars) and the population of Pakistan. If you compute their coefficient of correlation (i.e., "r"), it may be greater than 0.95. And if you regress one on the other, you may get an R-squared greater than 90%. Does this mean one is a good predictor of the other? Obviously not--the high R-squared merely means that one series with a trend is much better predicted by another series with a trend than by a "constant" model. (Remember that R-squared essentially measures the reduction in variance compared with the constant model.) But, in such a case, you could probably do even better--perhaps very much better--by using a model that predicted the series from its own history, such as a random walk, exponential smoothing, or ARIMA model.

Does this mean that regression is not a useful forecasting technique? Not at all! It just means that when you working with time series data, you need to be aware that a regression model may fail to exploit the "time dimension" unless the variables are carefully chosen. In particular, you may wish to consider using lagged and/or differenced variables in the forecasting equation, so that some of the history of the dependent and/or independent variables, as well as their current values, is used in the forecast.