Plots of
forecasts and residuals

Error measures (RMSE, MAE, MAPE, etc.)

Out-of-sample validation

R-squared (not!)

Significance of the estimated coefficients

Other patterns in the estimated coefficients

**(i)
Plots of forecasts and residuals:**
Do the forecasts "track" the data in a satisfactory
way, especially toward the end of the time series? Do the residuals
appear random with a mean value of zero, especially toward the
end of the time series? Are they free from trends, autocorrelation,
and heteroscedasticity? Are they more-or-less normally distributed?

If autocorrelation is a
problem, you should probably consider
changing the model so as to implicitly or explicitly include *lagged
variables*--e.g., try stationarizing the dependent and independent
variables via differencing, or add lags of the dependent and/or
independent variables to the regression equation, or introduce
an autoregressive error correction.

In SGWIN's Forecasting
procedure, you can difference the *dependent*
variable by selecting ARIMA as the model type and specifying one
order of non-seasonal differencing. This affects *only* the
dependent variable: you still have to "manually" difference
the independent variables, if desired. Alas, if you *do*
difference the independent variables, you will have to drop any
rows with missing observations to get around the missing-value
bug that affects time series regression models. I suggest that
you first use the Edit/Generate_Data command and the DIFF( ) function
to create new columns on the spreadsheet containing the differenced
variables, and use the right-mouse-button option to assign them
descriptive names such as DIFFX1, DIFFX2, or whatever. Then delete
the entire first row from the spreadsheet to get rid of the missing
values, return to the Forecasting procedure, and select the differenced
variables by name.

If heteroscedasticity and/or non-normality is a problem, you may wish to consider a nonlinear transformation of the dependent variable, such as logging or deflating, if such transformations are appropriate for your data. (Remember that logging converts multiplicative relationships to additive relationships, so that when you log the dependent variable, you are implicitly assuming that the relationships among the original variables are multiplicative.)

Note: You do not usually *rank
*(i.e., choose among) models
on the basis of their residual diagnostic tests, but bad residual
diagnostics indicate that the model's error measures may be unreliable
and that there are probably better models out there somewhere.
(Return to top of page.)

**(ii)
Error measures (RMSE, MAE, MAPE,
etc.):** Does the current regression model improve on the best
*naive* (random walk or random trend) model for the same
time series? Does it improve on the best model previously fitted?
In SGWIN's Forecasting procedure, the Model Comparison Report
makes it easy to compare error measures between different regression
models, even models that use different transformations of the
dependent variable (e.g., logging, differencing, deflation, etc.)

If instead you use generic
regression software (e.g., the simple
or multiple regression procedures on the Relate menu in SGWIN)
you can compare error measures between models *only* if their
units are comparable and they are fitted to the same set of
observations
of the dependent variable. Thus, for example, you cannot directly
compare the standard error of the estimate (root-mean-square error)
between a model fitted to Y and another model fitted to LOG(Y)
in the generic regression procedure.

A note on terminology: The
"standard error of the estimate"
in a regression model is the *root-mean-squared error of the
residuals, adjusted for the number of the coefficients estimated*.
In an ARIMA model, the same statistic is called the "white
noise standard deviation." If the dependent variable is nonlinearly
transformed, (e.g., logged or deflated) before estimating the
model, then the standard error of the estimate or white noise
standard deviation is computed from the *transformed* residuals.
This standard error or standard deviation is used to calculate
confidence intervals for the transformed forecasts, which are
later untransformed to get confidence intervals in the original
units of the data.

In SGWIN's Forecasting
procedure, the residual *plots* are
plots of the *transformed* residuals, which are supposed
to be free of heteroscedasticity. However, the MSE, RMSE, and
MAE values shown on the Analysis Summary and Model Comparison
reports are the statistics of the *untransformed* residuals
and forecast errors, which can be compared between models that
may have used different transformations. The Forecasting procedure
does not display the standard error of the estimate or R-squared
for a pure regression model, which is no great loss in light of
the comparative error statistics which * are* displayed.
(Return to top of page.)

**(iii)
Out-of-sample validation:**
If you have enough data to hold out a sizable portion for validation,
compare the performance of the models in the validation as well
as estimation periods. A good model should have small error measures
in *both* the estimation and validation periods, compared
to other models, and its validation period statistics should be
similar to its own estimation period statistics. Regression models
are especially susceptible to overfitting the data in the estimation
period, so watch out for models that have suspiciously low error
measures in the estimation period and suspiciously high error
measures in the validation period.

(Note: If the variance of the
errors in original, *untransformed*
units is growing over time due to inflation or compound growth,
then the best statistic to use for comparisons between the estimation
and validation period is mean absolute *percentage* error,
rather than mean squared error or mean absolute error.)

Although the model's
performance in the validation period is *theoretically*
the best indicator of its forecasting accuracy, you should be
aware that the hold-out sample may not always be highly representative,
especially if it is small--say, less than 20 observations. If
your validation period statistics appear strange or contradictory,
you may wish to experiment by changing the number of observations
held out. Sometimes the inclusion or exclusion of a few unusual
observations can make a big a difference in the comparative statistics
of different models.

Also, be aware that if you test
a *large number of models*
and rigorously rank them on the basis of their validation period
statistics, you may end up with just as much "data snooping
bias" as if you had only looked at estimation-period statistics--i.e.,
you may end up picking a model that is more lucky than good! The
best defense against this is to choose the *simplest* and
most *intuitively plausible* model that gives comparatively
good results. (Return to top of page.)

**(iv)
R-squared:** This is the most
over-used and abused of all statistics--don't get obsessed with
it. All it measures is the percentage reduction in mean-squared-error
that the regression model achieves relative to the naive model
"Y=constant", which may or may not be the appropriate
naive model for purposes of comparison. Better to determine the
*best* naive model first, and then compare the various error
measures of your regression model (both in the estimation and
validation periods) against that naive model.

Despite the fact that R-squared is a unitless statistic, there is no absolute standard for what is a "good" value. A regression model fitted to non-stationary time series data can have an R-squared of 99% and yet be inferior to a simple random walk model. On the other hand, a regression model fitted to stationarized time series data might have an R-squared of 10%-20% and be considered quite good. When working with stationary stock return data, R-squared values as low as 5% might even be considered significant--if they hold up out-of-sample! (Return to top of page.)

**(v)
Significance of estimated coefficents:**
Are the t-statistics greater than 2 in magnitude, corresponding
to p-values less than 0.05? (These statistics appear at the bottom
of the Analysis Summary report in SGWIN.) If they are not, you
should probably try to *refit the model with the least significant
variable excluded*, which is the "backward stepwise"
approach to model refinement.

Remember that the t-statistic
is just the *estimated coefficient
divided by its own standard error*. Thus, it measures "how
many standard deviations from zero" the estimated coefficient
is, and it is used to test the hypothesis that the true value
of the coefficient is non-zero, in order to confirm that the
independent
variable really belongs in the model.

The p-value is the probability of observing a t-statistic that large or larger in magnitude given the null hypothesis that the true coefficient value is zero. If the p-value is greater than 0.05--which occurs roughly when the t-statistic is less than 2 in absolute value--this means that the coefficient may be only "accidentally" significant.

There's nothing magical about the 0.05 criterion, but in practice it usually turns out that a variable whose estimated coefficient has a p-value of greater than 0.05 can be dropped from the model without affecting the error measures very much--try it and see! (Return to top of page.)

**(vi)
Other patterns in the estimated
coefficients:** Sometimes the interesting hypothesis is not
whether the value of a certain coefficient is equal to zero, but
whether is equal to some other value. For example, if one of the
independent variables is merely the dependent variable lagged
by one period (i.e., an autoregressive term), then the interesting
question is whether its coefficient is equal to *one*. If
so, then the model is effectively predicting the *difference*
in the dependent variable, rather than predicting its level, in
which case you can simplify the model by differencing the dependent
variable and deleting the lagged version of itself from the list
of independent variables.

Sometimes patterns in the
magnitudes and *signs* of lagged
variables are of interest. For example if both X and LAG(X,1)
are included in the model, and their estimated coefficients turn
out to have similar magnitudes but opposite signs, this suggests
that they could both be replaced by a single DIFF(X) term. (Return to top of page.)