What's the bottom line? How to compare
models
After fitting a number of
different statistical forecasting models
to a given data set, you usually have a wealth of criteria by
which they can be compared:
- Error measures in the estimation period:
mean squared error, mean absolute error, mean absolute percentage
error, mean error, mean percentage
error
- Error measures in the validation period: Ditto
- Residual diagnostics and goodness-of-fit tests:
plots of residuals
versus time, versus predicted values, and versus other variables;
residual autocorrelation plots, crosscorrelation plots, and normal
probability plots; the Durbin-Watson statistic (another indicator
of serial correlation); coefficients of skewness and kurtosis
(other indicators of non-normality); measures of extreme or influential
observations; tests for excessive runs, changes in mean, or changes
in variance (lots of things that can be "OK" or "not
OK")
- Qualitative considerations: appearance of forecast
plots, intuitive
reasonableness of the model, simplicity of the model
With so many plots and statistics and considerations to worry about,
it's sometimes hard
to know which comparisons are most important. What's the real
bottom line?
If there is any one statistic
that normally takes precedence over
the others, it is the mean squared error within the estimation
period, or equivalently its square root, the root mean squared
error. The latter quantity
is also known as the standard error of the estimate in
regression analysis or the estimated white noise standard deviation
in ARIMA analysis. This is the statistic whose value is minimized
during the parameter estimation process, and it is the statistic
that determines the width of the confidence intervals for predictions.
The 95% confidence intervals for one-step-ahead forecasts are
approximately equal to the point forecast "plus or minus
2 standard errors"--i.e., plus or minus 2 times the root-mean-squared
error.
Having planted this stake in
the ground, there are several observations
and qualifications that need to be made:
- For purposes of communicating your results
to others, it is
usually best to report the root mean squared error (RMSE)
rather than mean squared error (MSE), because the RMSE is measured
in the same units as the data, rather than in squared units,
and is representative of the size of a "typical" error.
- The mean absolute error (MAE) is also measured in the same
units as the original data, and is usually similar in magnitude
to, but slightly smaller than, the root mean squared error. The
mathematically challenged usually find this an easier statistic
to understand than the RMSE.
- The mean absolute percentage error (MAPE) is also
often
useful for purposes of reporting, because it is expressed in generic
percentage terms which will make some kind of sense even to someone
who has no idea what constitutes a "big" error in terms
of dollars spent or widgets sold. The MAPE can only be computed
with respect to data that are guaranteed to be strictly positive.
(Note: if the Statgraphics Forecasting procedure does not display
MAPE in its model-fitting results, this usually means that the
input variable contains zeroes or negative numbers, which can happen
if it was differenced outside the forecasting procedure or if
it represents a quantity that can honestly be zero in some periods. The
latter situation may arise when dealing with highly disaggregated
data--e.g., sales of a particular color of a particular model
by a particular store.)
- The mean error (ME) and mean percentage error
(MPE)
that are reported in some statistical procedures are signed
measures of error which indicate whether the forecasts are biased--i.e.,
whether they tend to be disproportionately positive or negative.
Bias is normally considered a bad thing, but it is not the bottom
line. Bias is one component of the mean squared error--in fact
mean squared error equals the variance of the errors plus the
square of the mean error. That is: MSE = VAR(E) + (ME)^2.
(This formula is useful when you need to compute MSE in a spreadsheet
model, as we did in the Outboard Marine
spreadsheet.) Hence, if you try to minimize mean squared error, you
are implicitly
minimizing the bias as well as the variance of the errors. In
a model that includes a constant term, the mean squared
error will be minimized when the mean error is exactly zero,
so you should expect the mean error to always be zero within the
estimation period in a model that includes a constant term. (Note: as
reported in the Statgraphics Forecasting procedure, the mean error in
the estimation period may be slightly different from zero if the model
included a log transformation as an option, because the forecasts and
errors are automatically unlogged before the statistics are
computed--see below.)
- The root mean squared error is more sensitive than other
measures
to the occasional large error: the squaring process gives
disproportionate weight to very large errors. If an occasional
large error is not a problem in your decision situation (e.g.,
if the true cost of an error is roughly proportional to the size
of the error, not the square of the error), then the MAE or MAPE
may be a more relevant criterion. In many cases these statistics
will vary in unison--the model that is best on one of them will
also be better on the others--but this may not be the case when
the error distribution has "outliers." (Actually, if
one model is best on one measure and another is best on another
measure, they are probably pretty similar in terms of their average
errors. In such cases you probably should give more weight to
some of the other criteria for comparing models--e.g., simplicity,
intuitive reasonableness, etc.)
- The root mean squared error (and mean absolute error) can
only
be compared between models whose errors are measured in the same
units (e.g., dollars, or constant dollars, or widgets sold,
or whatever). If one model's errors are adjusted for inflation
while those of another or not, or if one model's errors are in
absolute units while another's are in logged units, their error
measures cannot be directly compared. In such cases, you have
to convert the errors of both models into comparable units before
computing the various measures. This means converting the forecasts of
one model to the same units as those of the other by unlogging or
undeflating (or whatever), then subtracting those forecasts from actual
values to obtain errors in comparable units, then computing statistics
of those errors. You cannot get the same effect by merely
unlogging or undeflating the error statistics themselves! The
Forecasting procedure in Statgraphics
is designed to take care of these calculations for you: the errors are
automatically converted back into the original units of the input
variable (i.e., all transformations performed as options within
the Forecasting procedure are reversed) before computing the statistics
shown in the Analysis Summary report and Model Comparison report.
However, other procedures
in Statgraphics (and most other stat programs) do not make life this
easy for you.
- There is no absolute criterion for a "good" value
of RMSE or MAE: it depends on the units in which the
variable is measured and on the degree of forecasting accuracy,
as measured in those units, which is sought in a particular
application.
Depending on the choice of units, the MRSE or MAE of your
best model could be measured in zillions or one-zillionths. It
makes no sense to say "the model is good (bad) because the
root mean squared error is less (greater) than x", unless you
are referring to a specific degree of accuracy that is relevant
to your forecasting application.
- When comparing regression models that use the same
dependent
variable and the same estimation period, the root-mean-squared-error
goes down as adjusted R-squared goes up. Hence, the model
with the highest adjusted R-squared will have the lowest root mean
squared error, and you can just as well use adjusted R-squared
as a guide. However, when comparing regression models in which
the dependent variables were transformed in different ways (e.g.,
differenced in one case and undifferenced in another, or logged
in one case and unlogged in another), or which used different
sets of observations as the estimation period, R-squared is not
a reliable guide to model quality. (See the notes on "What's a good value for R-squared?")
- Don't split hairs: a model with a root-mean-squared error
of 3.25 is not really much better than one with an RMSE of 3.32.
Remember that the width of the confidence intervals is proportional
to the RMSE, and ask yourself how much of a relative decrease in
the width of the confidence intervals would be noticeable on a
plot. It may be useful to think of this in percentage terms: if
one model's RMSE is 30% lower than another's, that is probably
very significant. If it is 10% lower, that is probably somewhat
significant. If it is only 2% better, that is probably not significant.
These distinctions are especially important when you are trading off
model
complexity against the error measures: it is probably not worth
adding another independent variable to a regression model to decrease
the RMSE by only a few more percent. (Note: the RMSE and adjusted
R-squared statistics already include a minor adjustment for the number
of coefficients estimated in order to make them "unbiased estimators",
but a heavier penalty on model complexity really ought to be imposed
for purposes of selecting among models. Sophisticated software for
automatic model selection generally seeks to minimize error measures
which impose such a heavier penalty, such as the Mallows Cp statistic,
the Akaike Information Criterion (AIC) or Schwarz' Bayesian Information
Criterion (BIC). How these are computed is somewhat beyond the scope of
the current discussion, but suffice it to say that when you--rather
than the computer--are selecting among models, you should show some
preference for the model with fewer parameters, other things being
approximately equal.)
- The root mean squared error is a valid indicator of
relative model
quality only if it can be trusted. If there is evidence
that the model is badly mis-specified (i.e., if it grossly
fails the diagnostic tests of its underlying assumptions) or that
the data in the estimation period has been over-fitted
(i.e., if the model has a relatively large number of parameters
for the number of observations fitted and its comparative performance
deteriorates badly in the validation period), then the root mean
squared
error and all other error measures in the estimation period
may need to be heavily discounted. (If there is evidence only
of minor mis-specification of the model--e.g., modest amounts
of autocorrelation in the residuals--this does not completely
invalidate the model or its error statistics. Rather, it only
suggests that some fine-tuning of the model is still possible.
For example, it may indicate that another lagged variable could
be profitably added to a regression or ARIMA model.)
- The error measures in the validation period are
also
very important--indeed, in theory the model's performance in the
validation period is the best guide to its ability to predict
the future. The caveat here is the validation period is usually
a much smaller sample of data than the estimation period.
Hence, it is possible that a model may do unusually well or badly
in the validation period merely by virtue of getting lucky or
unlucky--e.g., by making the right guess about an unforeseeable
upturn or downturn in the near future, or by being less sensitive
than other models to an unusual event that happens at the start
of the validation period. Unless you have enough data to hold
out a large and representative sample for validation, it is probably
better to interpret the validation period statistics in a more
qualitative way: do they wave a "red flag" concerning
the possible unreliability of statistics in the estimation period,
or not? (Remember that the comparative error statistics that
Statgraphics reports for the estimation and validation periods
are in original, untransformed units. If you used a log
transformation as a model option in order to reduce heteroscedasticity
in the residuals, you should expect the unlogged errors in the
validation period to be much larger than those in the estimation
period. Of course, you can still compare validation-period statistics
across models in this case.)
- In trying to ascertain whether the error measures in the
estimation
period are reliable, you should consider whether the model under
consideration is likely to have overfitted the data. If
the model has only one or two parameters (such as a random walk,
exponential smoothing, or simple regression model) and was fitted
to a moderate or large sample of data (e.g., 30 observations or
more), then it is probably unlikely to have overfitted the data.
But if it has many parameters relative to the number of observations
in the estimation period (e.g., a model that uses seasonal adjustment
and/or a large number of regressors), then overfitting is a distinct
possibility. As a rough guide here, calculate the number of data
points in the estimation period per coefficient estimated, including
seasonal indices if any. If you have much less than 10 data points
per coefficient estimated, you should be alert to the possibility
of overfitting, and with less than 5 data points per coefficient there
is very real danger. (Think of it this way: how large a
sample of data would you want in order to estimate a single
coefficient, namely the mean? Although there are efficiencies to
be gained when estimating several coefficients simultaneously from the
same sample, this is still a useful guide.) For example, I would
hesitate to fit a model with
as many as 4 regressors to a sample of only 20 data points, and
I would be cautious about estimating seasonal indices with less
than 4 full seasons of data. Also, regression models which are
chosen by applying automatic model-selection techniques
(e.g., stepwise or all-possible regressions) to large numbers
of potential variables are prone to overfit the data, even
if the number of regressors in the final model is small. Of course,
sometimes you have little choice about the number of parameters
the model ought to include: for example, if the data are strongly
seasonal, then you must estimate the seasonal pattern in some
fashion, no matter how small the sample. But in such cases, you
should expect the errors made in predicting the future to be larger
than those that were made in fitting the past. (Note: ARIMA
models appear at first glance to require relatively few parameters
to fit seasonal patterns, but this is somewhat misleading. In
order to initialize a seasonal ARIMA model, it is necessarily
to estimate the seasonal pattern that occurred in "year 0,"
which is comparable to the problem of estimating a full set of
seasonal indices. Indeed, it is usually claimed that more seasons
of data are required to fit a seasonal ARIMA model than to fit
a seasonal decomposition model.)
- By the same token, in trying to judge whether the error
statistics
are reliable, you should ask whether it is likely that
the model is mis-specified. Are its assumptions intuitively
reasonable? Would it be easy or hard to explain this model to
someone else? Do the forecast
plots look like a reasonable extrapolation of the past data? If the
assumptions seem reasonable, then it is more likely that
the error statistics can be trusted than if the assumptions were
questionable.
- Although the confidence intervals for one-step-ahead
forecasts
are based almost entirely on RMSE, the confidence intervals for
the longer-horizon forecasts than can be produced by time-series
models depend heavily on the underlying modeling assumptions,
particularly assumptions about the variability of the trend.
The confidence intervals for some models widen relatively slowly
as the forecast horizon is lengthened (e.g., simple exponential
smoothing models with small values of "alpha", simple
moving averages, seasonal random walk models, and linear trend
models). The confidence intervals widen much faster for other
kinds of models (e.g., nonseasonal random walk models, seasonal
random trend models, or linear exponential smoothing models).
The rate at which the confidence intervals widen is not a reliable
guide to model quality: what is important is the model should
be making the correct assumptions about how uncertain the
future is. It is very important that the model should pass the
various residual diagnostic tests and "eyeball" tests
in order for the confidence intervals for longer-horizon forecasts
to be taken seriously.
So... the bottom line is that
you should put the most weight on
the error measures in the estimation period--most often
the RMSE, but sometimes MAE or MAPE--when comparing among
models. (If your software is capable of computing them, you may also
want to look at Cp, AIC or BIC.) But you should keep an eye on the
validation-period results,
residual diagnostic tests, and qualitative considerations such
as the intuitive reasonableness and simplicity of your model. The
residual diagnostic tests are not the bottom line--you should
never choose Model A over Model B merely because model B got more
"OK's" on its residual tests. (What would you rather
have: smaller errors or more random-looking errors?) A model which
fails some of the residual tests or reality checks in only a minor
way is probably subject to further improvement,
whereas it is the model which flunks such tests in a major way
that cannot be trusted.
The validation-period
results are not necessarily the last word either, because of the issue
of sample size: if Model A
is slightly better in a validation period of size 10 while Model
B is much better over an estimation period of size 40,
I would study the data closely to try to ascertain whether Model A
merely "got lucky"
in the validation period.
Finally, remember to K.I.S.S.
(keep it simple...) If two models are generally similar in terms of
their error statistics and other diagnostics, you should prefer the one
that is simpler and/or easier to understand. The simpler model is
likely to be closer to the truth, and it will usually be more easily
accepted by others.