Several persons have asked the question: how do you do head-to-head comparisons of models fitted in the regression procedure (where differencing, logging, and/or seasonal adjustment must be applied “manually” to the input variable) against models fitted

Notes concerning the final project

Seasonal differencing versus seasonal adjustment: A seasonal difference is not the same thing as seasonal adjustment: seasonal adjustment is supposed to remove all the seasonal pattern while leaving the trend and cyclical patterns unaffected. Seasonal differencing computes the change relative to the previous season's value, which normally removes most but not all of the seasonal pattern and simultaneously removes most of the trend.

Seasonal differencing often makes the series approximately--but not quite--stationary. An AR(1) term--i.e. one lag of the seasonally differenced series-- then often helps to fully stationarize the series. (In extreme cases a nonseasonal difference may be needed in addition to a seasonal difference in order to stationarize the series, and this usually leads to a slight case of overdifferencing which is best corrected by MA or SMA terms.)

The SDIFF function can be used to take a seasonal difference--e.g., SDIFF(Y,12) is the seasonal difference of a series Y with seasonal period 12. You do not normally need to use this function in any of the time series procedures, where seasonal differencing is a modeling option, but you might conceivably use it on the dependent variable in a regression model.

Seasonal adjustment, on the other hand, is done in the Seasonal Decomposition procedure, and after running this procedure you must save the seasonally adjusted values as a new column on the spreadsheet (the default name is SADJUSTED). You may also want to save the seasonal indices (the default name is INDICES). If you save the seasonal indices, you will initially get only 12 values--one for each month in the first year. If later you wish to use the seasonal indices to "reseasonalize" a forecast, you will need to extend them down to the bottom of the column using the RESHAPE function. To do this, highlight the INDICES column, select the "Generate Data" command, and type RESHAPE(INDICES,293) as the expression. This will repeat the pattern of seasonal indices over and over down to row 293 (or use whatever row number is appropriate here.)

Strategy 1 for comparing models--stay in the Forecasting procedure: It will be easiest to make the various comparisons if you can do most of your work in the same procedure. For example, if you begin by fitting various time series models in the Forecasting procedure, you can follow the approach that was outlined in lecture 11: save the residuals of the best ARIMA model (using the "Save Results" button on the Analysis Window toolbar, 4th from the left), and then use the Descriptive Methods procedure to see whether there are any interesting cross-correlations between the residuals and other variables. You could also play around with multiple regression models to see if combinations of variables were helpful in explaining some of the residual variance. If you identify one or more variables which, when lagged by at least one period, appear to be significant predictors of the residuals, then you can return to the Forecasting procedure and try adding these variables as regressors (using the Regression option). In version 1.4 you can use up to 5 regressors, and in version 2.1 you can use only one. (Also, remember that if you lag and/or difference the regressors, it may be helpful to store them as new columns and then delete rows that contain missing values, etc.)

Strategy 2 for comparing models--stay (mostly) in the regression and/or GLM procedures: If you want to do most of your work in the regression procedure, you will need to decide straight off on a method for dealing with seasonality. The two most straightforward options are (a) seasonally adjust the dependent variable (perhaps after deflating), or (b) fit a regression model which corresponds to a seasonal ARIMA model that you have already identified as a reasonable model--in which the SDIFF function is presumably applied to the dependent variable--and then try to add other regressors. In either case, you will simplify matters if you deal with inflation in the same way in all your models. If you are going to deflate (i.e., divide by CPI83) then you should do it in all procedures as the very first step.

Using seasonal adjustment in the regression procedure: If you seasonally adjust the dependent variable in the regression procedure, it is somewhat messy (though not impossible) to undo the seasonal adjustment to compare the model's errors to those of other models fitted in the Forecasting procedure. For example, suppose that you seasonally adjust your deflated sales variable (RSALES/CPI83 or whatever) and save the adjusted values as SADJUSTED, then you fit a regression model in which DIFF(SDADJUSTED) is used as the dependent variable. (This was one of the strategies we tried with the auto sales data.) If you then save the residuals of this model, you can reseasonalize the residuals by multiplying them by the INDICES variable you created during the seasonal adjustment step (assuming that you have also repeated its values down to the end of the column). Separately, you can save the residuals of one or more models fitted in the Forecasting procedure (using RSALES/CPI83 or whatever) as the dependent variable. (You should of course assign different names to the two residual series--e.g., REGRESIDS and ARIMRESIDS.) Then all the saved residuals should be in comparable units, and you could (for example) use the Compare/Two Samples/Two Sample Comparison procedure to compare them.

If you want to go all the way and undifference your regression forecasts as well as reseasonalize them, you must save the predicted values (the default name is PREDICTED), then add the predicted differences to the previous level of the original variable (which in this case would be LAG(SADJUSTED,1)) and finally multiply by the seasonal indices to reseasonalize them. Thus, for example, you could use the "Generate Data" command to create new column of data generated by the expression:

INDICES*(PREDICTED+LAG(SADJUSTED,1))

Using seasonal differencing in the regression procedure: Don't try blind combinations of seasonal and nonseasonal lags and differences of the dependent variable in the regression procedure. If you are going to use seasonal and nonseasonal differences, I suggest that you first use the Forecasting procedure to identify an appropriate ARIMA model that uses no MA or SMA terms. Often you can substitute one or two AR terms for an MA(1) term, or an SAR(1) term for an SMA(1) term, without too much loss of efficiency. For example, suppose that an ARIMA(1,0,0)x(0,1,1) model is a good ARIMA model for a particular series. Then probably an ARIMA(1,0,0)x(1,1,0) model will also fit reasonably well, and the latter model can be replicated exactly in the regression procedure by regressing SDIFF(RSALES/CPI83,12) on LAG(...,1) of itself and LAG(...,12) of itself. You can then try adding other regressors to this basic model.

If you do seasonally difference the dependent variable in the regression or GLM procedures--e.g., if you use SDIFF(RSALES/CPI83,12) as the dependent variable--then the error statistics of the model should be directly comparable to those of any time series model fitted to RSALES/CPI83 in the Forecasting procedure, without any further hassle.

If you wish to un-seasonally-difference the forecasts for any reason, you must add the predicted seasonal difference to the seasonally lagged series--i.e., you would use

PREDICTED+LAG(RSALES/CPI83,12)

Log rather than deflate? If you decide to log rather than deflate by the CPI, and you need to make comparisons between the regression and forecasting procedures, I suggest that you manually LOG the input variable in every procedure so that all the output is in logged units. In other words, use LOG(RSALES) as the input variable in all procedure, rather than trying to specify "natural log" as a modeling option in some procedures. In all of the examples above, RSALES/CPI83 would then be replaced by LOG(RSALES)

Errors to avoid:

Do not try to use the DIFF or SDIFF operations on the input variables for the Descriptive Methods or Forecasting procedures. If you want to use differences as part of your time series models, this should normally be done as modeling options within these procedure.
Do not run BACKWARDS stepwise regression with dozens of variables--if you're "just fishing," use the forward stepwise option.
If you use seasonal adjustment, do it after deflating (if any) but before any differencing is performed. In other words, do not try to seasonally adjust a variable which has already been differenced--it may contain negative numbers, and the (multiplicative) seasonal adjustment algorithm is only meaningful for positive data. Also, remember that that if you have logged the data as the first step, you should use additive rather than multiplicative seasonal adjustment.