 ARIMA models for time series forecasting

ARIMA models with regressors

An ARIMA model can be considered as a special type of regression model--in which the dependent variable has been stationarized and the independent variables are all lags of the dependent variable and/or lags of the errors--so it is straightforward in principle to extend an ARIMA model to incorporate information provided by leading indicators and other exogenous variables: you simply add one or more regressors to the forecasting equation.

Alternatively, you can think of a hybrid ARIMA/regression model as a regression model which includes a correction for autocorrelated errors. If you have fitted a multiple regression model and find that its residual ACF and PACF plots display an identifiable autoregressive or moving-average "signature" (e.g., some significant pattern of autocorrelations and/or partial autocorrelations at the first few lags and/or the seasonal lag), then you might wish to consider adding ARIMA terms (lags of the dependent variable and/or the errors) to the regression model to eliminate the autocorrelation and further reduce the mean squared error. To do this, you would just re-fit the regression model as an ARIMA model with regressors, and you would specify the appropriate AR and/or MA terms to fit the pattern of autocorrelation you observed in the original residuals.

Most high-end forecasting software offers one or more options for combining the features of ARIMA and multiple regression models. In the Forecasting procedure in Statgraphics, you can do this by specifying "ARIMA" as the model type and then hitting the "Regression" button to add regressors. (Alas, you are limited to 5 additional regressors.) When you add a regressor to an ARIMA model in Statgraphics, it literally just adds the regressor to the right-hand-side of the ARIMA forecasting equation. To use a simple case, suppose you first fit an ARIMA(1,0,1) model with no regressors. Then the forecasting equation fitted by Statgraphics is:

Ŷt = μ + ϕ1Yt-1 - θ1et-1

which can be rewritten as:

Ŷt - ϕ1Yt-1 = μ - θ1et-1

(Note: this is a standard mathematical form which is often used for ARIMA models. All terms involving the dependent variable--i.e., all the AR terms and differences--are collected on the left-hand-side of the equation, while all terms involving the erorrs--i.e., the MA terms--are collected on the right-hand side.) Now, if you add a regressor X to the forecasting model, the equation fitted by Statgraphics is:

Ŷt - ϕ1Yt-1 = μ - θ1et-1 + β(Xt - ϕ1Xt-1)

Thus, the AR part of the model (and also the differencing transformation, if any) is applied to the X variable in exactly the same way as it is applied to the Y variable before X is multiplied by the regression coefficient.   This effectively means that the ARIMA(1,0,1) model is fitted to the errors of the regression of Y on X (i.e., the series "Y minus beta X").

How can you tell if it might be helpful to add a regressor to an ARIMA model? One approach would be to save the RESIDUALS of the ARIMA model and then look at their cross-correlations with other potential explanatory variables. For example, recall that we previously fitted a regression model model to seasonally adjusted auto sales, in which the LEADIND variable (index of eleven leading economic indicators) turned out to be slightly significant in addition to lags of the stationarized sales variable. Perhaps LEADIND would also be helpful as a regressor in the seasonal ARIMA model we subsequently fitted to auto sales.

To test this hypothesis, the RESIDUALS from the ARIMA(0,1,1)x(0,1,1) model fitted to AUTOSALE were saved. Their cross-correlations with DIFF(LOG(LEADIND)), plotted in the Descriptive Methods procedure, are as follows: (A couple of minor technical points to note here: we have logged and differenced LEADIND to stationarize it because the RESIDUALS of the ARIMA model are also logged and differenced--i.e., expressed in units of percentage change. Also, the Descriptive Methods procedure, like the Forecasting procedure, does not like variables which begin with too many missing values. Here the missing values at the beginning of the RESIDUALS variables were replaced by zeroes--typed in by hand--before running the Descriptive Methods procedure. Actually, the Forecasting procedure is supposed to automatically draw cross-correlation plots of the residuals versus other variables, but the graph which is labeled "Residual Cross-Correlation Plot" merely shows the cross-correlations of the input variable versus other variables.)

We see that the most significant cross-correlation is at lag 0, but unfortunately we cannot use that for forecasting one month ahead. Instead, we must try to exploit the smaller cross-correlations at lags 1 and/or 2. As a quick test of whether lags of DIFF(LOG(LEADIND)) are likely to add anything to our ARIMA model, we can use the Multiple Regression procedure to regress RESIDUALS on lags of DIFF(LOG(LEADIND)). Here is the result of regressing RESIDUALS on LAG(DIFF(LOG(LEADIND)),1): The R-squared value of only 3.66% suggests that not much improvement is possible. (If two lags of DIFF(LOG(LEADIND)) are used, the R-squared only increases to 4.06%.) If we return to the ARIMA procedure and add LAG(DIFF(LOG(LEADIND)),1) as a regressor, we obtain the following model-fitting results: (Minor technical point here: we stored the values of LAG(DIFF(LOG(LEADIND)),1) in a new column, filled in the two missing values at the beginning with zeroes, and assigned the new column the name LGDFLGLEAD.) We see that when a coefficient for the lag of DIFF(LOG(LEADIND)) is estimated simultaneously with the other parameters of the model, it is even less significant than it was in the regression model for RESIDUALS. The improvement in the root-mean-squared error is just too small to be noticeable.

The negative result we obtained here should not be taken to suggest that regressors will never be helpful in ARIMA models or other time series models. For example, variables which measure advertising or price levels or the occurrence of promotional events are often helpful in augmenting ARIMA models (and exponential smoothing models) for forecasting sales at the level of the firm or product. Remember that the variable being analyzed here--nationwide sales at automotive dealers--is a highly aggregated macroeconomic time series. We have learned by now that the impact on a macroeconomic variable of events which occurred in earlier periods (e.g., changes in various economic factors that make up the index of leading indicators) is often most clearly represented in the prior history of that variable itself. Hence, lagged values of other macroeconomic time series may have little to add to a forecasting model which has already fully exploited the history of the original time series. Leading economic indicators are often more useful when applied as they are intended--namely as indicators of turning points in business cycles that may have a bearing on the direction of longer-term trend projections.