What's the bottom line? How to compare models

After fitting a number of different statistical forecasting models to a given data set, you usually have a wealth of criteria by which they can be compared:

With so many plots and statistics and considerations to worry about, it's sometimes hard to know which comparisons are most important. What's the real bottom line?

If there is any one statistic that normally takes precedence over the others, it is the mean squared error within the estimation period, or equivalently its square root, the root mean squared error. The latter quantity is also known as the standard error of the estimate in regression analysis or the estimated white noise standard deviation in ARIMA analysis. This is the statistic whose value is minimized during the parameter estimation process, and it is the statistic that determines the width of the confidence intervals for predictions. The 95% confidence intervals for one-step-ahead forecasts are approximately equal to the point forecast "plus or minus 2 standard errors"--i.e., plus or minus 2 times the root-mean-squared error.

Having planted this stake in the ground, there are several observations and qualifications that need to be made:

So... the bottom line is that you should put the most weight on the error measures in the estimation period--most often the RMSE, but sometimes MAE or MAPE--when comparing among models. (If your software is capable of computing them, you may also want to look at Cp, AIC or BIC.) But you should keep an eye on the validation-period results, residual diagnostic tests, and qualitative considerations such as the intuitive reasonableness and simplicity of your model. The residual diagnostic tests are not the bottom line--you should never choose Model A over Model B merely because model B got more "OK's" on its residual tests. (What would you rather have: smaller errors or more random-looking errors?) A model which fails some of the residual tests or reality checks in only a minor way is probably subject to further improvement, whereas it is the model which flunks such tests in a major way that cannot be trusted.

The validation-period results are not necessarily the last word either, because of the issue of sample size: if Model A is slightly better in a validation period of size 10 while Model B is much better over an estimation period of size 40, I would study the data closely to try to ascertain whether Model A merely "got lucky" in the validation period.

Finally, remember to K.I.S.S. (keep it simple...) If two models are generally similar in terms of their error statistics and other diagnostics, you should prefer the one that is simpler and/or easier to understand. The simpler model is likely to be closer to the truth, and it will usually be more easily accepted by others.