What to look for in regression model output


Plots of forecasts and residuals
Error measures (RMSE, MAE, MAPE, etc.)
Out-of-sample validation
R-squared (not!)
Significance of the estimated coefficients
Other patterns in the estimated coefficients


(i) Plots of forecasts and residuals: Do the forecasts "track" the data in a satisfactory way, especially toward the end of the time series? Do the residuals appear random with a mean value of zero, especially toward the end of the time series? Are they free from trends, autocorrelation, and heteroscedasticity? Are they more-or-less normally distributed?

If autocorrelation is a problem, you should probably consider changing the model so as to implicitly or explicitly include lagged variables--e.g., try stationarizing the dependent and independent variables via differencing, or add lags of the dependent and/or independent variables to the regression equation, or introduce an autoregressive error correction.

In SGWIN's Forecasting procedure, you can difference the dependent variable by selecting ARIMA as the model type and specifying one order of non-seasonal differencing. This affects only the dependent variable: you still have to "manually" difference the independent variables, if desired. Alas, if you do difference the independent variables, you will have to drop any rows with missing observations to get around the missing-value bug that affects time series regression models. I suggest that you first use the Edit/Generate_Data command and the DIFF( ) function to create new columns on the spreadsheet containing the differenced variables, and use the right-mouse-button option to assign them descriptive names such as DIFFX1, DIFFX2, or whatever. Then delete the entire first row from the spreadsheet to get rid of the missing values, return to the Forecasting procedure, and select the differenced variables by name.

If heteroscedasticity and/or non-normality is a problem, you may wish to consider a nonlinear transformation of the dependent variable, such as logging or deflating, if such transformations are appropriate for your data. (Remember that logging converts multiplicative relationships to additive relationships, so that when you log the dependent variable, you are implicitly assuming that the relationships among the original variables are multiplicative.)

Note: You do not usually rank (i.e., choose among) models on the basis of their residual diagnostic tests, but bad residual diagnostics indicate that the model's error measures may be unreliable and that there are probably better models out there somewhere. (Return to top of page.)


(ii) Error measures (RMSE, MAE, MAPE, etc.): Does the current regression model improve on the best naive (random walk or random trend) model for the same time series? Does it improve on the best model previously fitted? In SGWIN's Forecasting procedure, the Model Comparison Report makes it easy to compare error measures between different regression models, even models that use different transformations of the dependent variable (e.g., logging, differencing, deflation, etc.)

If instead you use generic regression software (e.g., the simple or multiple regression procedures on the Relate menu in SGWIN) you can compare error measures between models only if their units are comparable and they are fitted to the same set of observations of the dependent variable. Thus, for example, you cannot directly compare the standard error of the estimate (root-mean-square error) between a model fitted to Y and another model fitted to LOG(Y) in the generic regression procedure.

A note on terminology: The "standard error of the estimate" in a regression model is the root-mean-squared error of the residuals, adjusted for the number of the coefficients estimated. In an ARIMA model, the same statistic is called the "white noise standard deviation." If the dependent variable is nonlinearly transformed, (e.g., logged or deflated) before estimating the model, then the standard error of the estimate or white noise standard deviation is computed from the transformed residuals. This standard error or standard deviation is used to calculate confidence intervals for the transformed forecasts, which are later untransformed to get confidence intervals in the original units of the data.

In SGWIN's Forecasting procedure, the residual plots are plots of the transformed residuals, which are supposed to be free of heteroscedasticity. However, the MSE, RMSE, and MAE values shown on the Analysis Summary and Model Comparison reports are the statistics of the untransformed residuals and forecast errors, which can be compared between models that may have used different transformations. The Forecasting procedure does not display the standard error of the estimate or R-squared for a pure regression model, which is no great loss in light of the comparative error statistics which are displayed. (Return to top of page.)


(iii) Out-of-sample validation: If you have enough data to hold out a sizable portion for validation, compare the performance of the models in the validation as well as estimation periods. A good model should have small error measures in both the estimation and validation periods, compared to other models, and its validation period statistics should be similar to its own estimation period statistics. Regression models are especially susceptible to overfitting the data in the estimation period, so watch out for models that have suspiciously low error measures in the estimation period and suspiciously high error measures in the validation period.

(Note: If the variance of the errors in original, untransformed units is growing over time due to inflation or compound growth, then the best statistic to use for comparisons between the estimation and validation period is mean absolute percentage error, rather than mean squared error or mean absolute error.)

Although the model's performance in the validation period is theoretically the best indicator of its forecasting accuracy, you should be aware that the hold-out sample may not always be highly representative, especially if it is small--say, less than 20 observations. If your validation period statistics appear strange or contradictory, you may wish to experiment by changing the number of observations held out. Sometimes the inclusion or exclusion of a few unusual observations can make a big a difference in the comparative statistics of different models.

Also, be aware that if you test a large number of models and rigorously rank them on the basis of their validation period statistics, you may end up with just as much "data snooping bias" as if you had only looked at estimation-period statistics--i.e., you may end up picking a model that is more lucky than good! The best defense against this is to choose the simplest and most intuitively plausible model that gives comparatively good results. (Return to top of page.)


(iv) R-squared: This is the most over-used and abused of all statistics--don't get obsessed with it. All it measures is the percentage reduction in mean-squared-error that the regression model achieves relative to the naive model "Y=constant", which may or may not be the appropriate naive model for purposes of comparison. Better to determine the best naive model first, and then compare the various error measures of your regression model (both in the estimation and validation periods) against that naive model.

Despite the fact that R-squared is a unitless statistic, there is no absolute standard for what is a "good" value. A regression model fitted to non-stationary time series data can have an R-squared of 99% and yet be inferior to a simple random walk model. On the other hand, a regression model fitted to stationarized time series data might have an R-squared of 10%-20% and be considered quite good. When working with stationary stock return data, R-squared values as low as 5% might even be considered significant--if they hold up out-of-sample! (Return to top of page.)


(v) Significance of estimated coefficents: Are the t-statistics greater than 2 in magnitude, corresponding to p-values less than 0.05? (These statistics appear at the bottom of the Analysis Summary report in SGWIN.) If they are not, you should probably try to refit the model with the least significant variable excluded, which is the "backward stepwise" approach to model refinement.

Remember that the t-statistic is just the estimated coefficient divided by its own standard error. Thus, it measures "how many standard deviations from zero" the estimated coefficient is, and it is used to test the hypothesis that the true value of the coefficient is non-zero, in order to confirm that the independent variable really belongs in the model.

The p-value is the probability of observing a t-statistic that large or larger in magnitude given the null hypothesis that the true coefficient value is zero. If the p-value is greater than 0.05--which occurs roughly when the t-statistic is less than 2 in absolute value--this means that the coefficient may be only "accidentally" significant.

There's nothing magical about the 0.05 criterion, but in practice it usually turns out that a variable whose estimated coefficient has a p-value of greater than 0.05 can be dropped from the model without affecting the error measures very much--try it and see! (Return to top of page.)


(vi) Other patterns in the estimated coefficients: Sometimes the interesting hypothesis is not whether the value of a certain coefficient is equal to zero, but whether is equal to some other value. For example, if one of the independent variables is merely the dependent variable lagged by one period (i.e., an autoregressive term), then the interesting question is whether its coefficient is equal to one. If so, then the model is effectively predicting the difference in the dependent variable, rather than predicting its level, in which case you can simplify the model by differencing the dependent variable and deleting the lagged version of itself from the list of independent variables.

Sometimes patterns in the magnitudes and signs of lagged variables are of interest. For example if both X and LAG(X,1) are included in the model, and their estimated coefficients turn out to have similar magnitudes but opposite signs, this suggests that they could both be replaced by a single DIFF(X) term. (Return to top of page.)