Fitting time series regression models

Why do simple time series models sometimes outperform regression models fitted to nonstationary data?

Remember that if X and Y are nonstationary, this means that we cannot necessarily assume their statistical properties (such as their correlations with each other) are constant over time.

In other words, recent values of Y might be good "proxies" not only for the effect of X but also for the effects of any omitted variables.

How to get the best of both worlds--regression and time series models:

1. Stationarize the variables (by differencing, logging, deflating, or whatever) before fitting a regression model.

Example: instead of regressing Y on X, regress DIFF(Y) on DIFF(X)

The regression equation is now: Ỹ(t) - Y(t-1) = a + b(X(t) - X(t-1))

...which is equivalent to : Ỹ(t) = Y(t-1) + a + bX(t) - bX(t-1)

Notice that this brings the prior values of both X and Y into the prediction.

2. Use lagged versions of the variables in the regression model.

Example: instead of regressing Y on X, regress Y on LAG(X,1) and LAG(Y,1)

The regression equation is now Ỹ(t) = a + bX(t-1) + cY(t-1)

3. It often helps to do both--i.e., to stationarize the variables by differencing, then use lags of the stationarized variables.

Example: regress DIFF(Y) on LAG(DIFF(X),1) and LAG(DIFF(Y),1)

The regression equation is now: Ỹ(t) - Y(t-1) = a + b(X(t-1) - X(t-2)) + c(Y(t-1) - Y(t-2))

...which is equivalent to Ỹ(t) = Y(t-1) + a + b(X(t-1) - X(t-2)) + c(Y(t-1) - Y(t-2))

How to decide which combinations of lags or differences to try?

1. Determine which transformations (if any) are needed to stationarize each variable by looking at time series plots and autocorrelation plots

2. Look at autocorrelations of the stationarized dependent variable (e.g., DIFF(Y)) to determine whether one or more of its lags is likely to be helpful

3. Look at cross-correlations between the stationarized dependent variable (the "first" series) and stationarized independent variables (the "second" series).

Fitting a regression model to differenced and/or lagged data:

1. Regress the stationarized dependent variable on lags of itself and/or stationarized independent variables as suggested by autocorrelation and cross-correlation analysis

Example: DIFF(Y) shows a significant autocorrelation at lags 1 and 2 but not at higher lags, and DIFF(Y) shows a significant cross-correlation with DIFF(X) at lags 0 and 1.

2. You can use a "stepwise" approach to model-fitting but beware of over-fitting the data. Be cautious in choosing how many lags and how many different independent variables to include at the beginning of the process.

3. It is especially important to VALIDATE your model with hold-out data when selecting models by automatic methods. Ideally you should withhold data during the model-selection process as well as during the final testing of the model.

4. The good news: The Forecasting procedure can be used to fit regression models to lagged and differenced data and to validate them.

(Setting AR=2 means that you want 2 autoregressive terms--i.e., the first two lags of the differenced dependent variable--included in the forecasting equation.)

5. The bad news: You can also, in principle, add regressors to the ARIMA model such as LAG(DIFF(X),1) etc.   However, the regression option gives a "data error" message if there is more than 1 missing value at the beginning of a regressor, which will be the case if the total number of lags or differences is more than 1. (The regressor could either be lagged by one period, or else differenced, but not both.)

shopify traffic stats