Recap: Varieties of (not-so-) simple regression models

Not-so-simple regression models

Overview
A word on Statgraphics implementation
1. Simple linear regression model
2. Power (a.k.a. multiplicative or constant-elasticity) model
3. Absolute change (differenced) model
4. Percentage change (logged & differenced) model
5. First-order autoregressive ("AR") model
6. Differenced first-order autoregressive model
7. Logged and differenced first-order AR model
8. Linear trend model
9. Exponential trend model
Additional comments

Overview. A "simple" regression model is a regression model with a single independent variable. However, such models are not always so simple--depending on how you choose the independent variable and on how you choose the transformations, if any, that are to be applied to one or both variables, they can be quite flexible and complex. Some of the possibilities are listed below. All of these models can be fleshed out into multiple regression models by adding more independent variables of the same general type.

The details of the models given below will probably look intimidating (if not bewildering!) at first. The good news is that you do not need to commit them to memory, nor should you plan on applying them all blindly or mechanically to new data. Your analysis should begin with a graphical exploration of the data to determine the qualitative properties of your variables: are they stationary or nonstationary? Do they show linear or compound growth, and is inflation a signficant factor? Do they have a constant variance or increasing variance (heteroscedasticity)? And so on. By doing this you should develop some intuition about the transformations, if any, that would be appropriate to apply before attempting to fit a linear model. (When fitting regression models to time series data, it is often helpful to first stationarize the data through a combination of differencing, logging, and/or deflating.) In the end, a simple regression model is merely a straight line fitted to a scatterplot of one variable versus another. Your main task is to choose and adjust the variables in such a way that the hypothesis of a straight-line relationship is intuitively reasonable and is eventually confirmed by the model-fitting results.

Return to top of page.

A word on Statgraphics implementation. In the notes below, "SGWIN Regression" means either the Simple Regression procedure, the Multiple Regression procedure, or else the General Linear Models procedure in the Advanced Regression module. The Advanced Regression module is available version 2.1 only, and its General Linear Models procedure combines features of regression and ANOVA models: the independent variables may be either "categorical" or "quantitative", and interactions among categorical variables can be estimated. The Advanced Regression module also includes (shudder!) a "Regression Model Selection" procedure which computes all possible regressions.

The Forecasting procedure can also be used to fit regression models: when the model type is set to Mean or ARIMA, you are given the option to add regressor variables. (Caution: the latter feature does not work well in version 1.4, particularly when you are trying to use lagged or differenced variables. Also, including regression variables usually means that you must set the "number of forecasts" to zero on the data input panel to avoid getting a "missing data" error--you cannot ask a regression model to produce forecasts for future periods unless values for the independent variables have been specified for those periods!)

In both the Forecasting procedure and the Advanced Regression module, you have the option to hold out data for validation, which is normally a good idea. (In the Advanced Regression procedure, data is held out by using the "Select" field to select only a subset of the data for estimation. For example, you could type INDEX<81 in the Select field to select only the first 80 values for estimation--assuming that INDEX is an existing time index variable.) Both of these procedures also offer autocorrelation plots of residuals as a standard option, which is an important diagnostic tool for time series models. (In Advanced Regression, an autocorrelation plot is a right-mouse-button option for the residual plot.) To produce an autocorrelation plot of residuals when using one of the standard regression procedures, you must save the residuals to the data spreadsheet and then run the Time Series/Descriptive_Methods procedure using "Residuals" as the input variable. Return to top of page.

1. SIMPLE LINEAR REGRESSION MODEL

Assumption: "Y is a linear function of X"

Equation: Ý(t) = a + bX(t)

How to interpret the coefficients: a is the intercept and b is the slope on a plot of predicted Y versus X

How to fit in SGWIN Regression: regress Y on X.

How to fit in SGWIN Forecasting: use Y as the input variable and specify a Mean model with X as a regressor.

When to use: when X and Y are stationary time series or cross-sectional (non-time-series) variables, and a scatter plot of Y versus X suggests a significant linear relationship. . Return to top of page.

2. POWER (a.k.a. MULTIPLICATIVE or CONSTANT-ELASTICITY) MODEL

Assumption: "Y is proportional to some power of X", or "the elasticity of Y with respect to X is constant"

Equation: Ý(t) = c(X(t)^b)

...or.... log(Ý(t)) = a + b log (X(t))

...where a = log(c).

How to interpret the coefficients: EXP(a) is the multiplicative constant (scaling factor) and b is the estimated elasticity of Y with respect to X, which is defined as (X/Y)(dY/dX)

How to fit in SGWIN Regression: regress LOG(Y) on LOG(X).

How to fit in SGWIN Forecasting: use Y as the input variable and specify a Mean model with a natural log transformation and LOG(X) as a regressor. The latter approach will produce tables and plots of the forecasts in unlogged terms. (Note that when you click the "Natural Log" button, this affects only the dependent variable--you have to log the regressor variable "manually" by explicitly using the LOG( ) function.)

When to use: multiplicative regression models are often used to measure elasticities of demand with respect to variables such as price or advertising expenditures. (Caution: if X and Y are nonstationary time series, you may want to deflate them first--if they are measured in nominal currency--or consider the differenced form of this model--see #4 below.) Return to top of page.

3. ABSOLUTE CHANGE (DIFFERENCED) MODEL

Assumption: "The absolute change in Y is a linear function of the absolute change in X"

Equation: (Ý(t) - Y(t-1)) = a + b(X(t) - X(t-1))

How to interpret the coefficients: a is the predicted change in Y in the absence of a change in X, and b is the constant of proportionality between changes in X and changes in Y

How to fit in SGWIN Regression: regress DIFF(Y) on DIFF(X).

How to fit in SGWIN Forecasting: use Y as the input variable, specify an ARIMA model with 1 nonseasonal difference, constant=ON, and DIFF(X) as a regressor. The latter approach will produce reports and plots of the forecasts in undifferenced terms.

When to use: when X and Y are nonstationary time series whose that do not exhibit nonlinear trends and/or heteroscedasticity. Return to top of page.

4. RELATIVE CHANGE (LOGGED & DIFFERENCED) MODEL

Assumption: "The percentage change in Y is a linear function of the percentage change in X"

Equation: (Ý(t) - Y(t-1))/Y(t-1) = a + b(X(t) - X(t-1))/X(t-1)

...or... log(Ý(t)) - log(Y(t-1)) = a + b(log(X(t)) - log(X(t-1)))

How to interpret the coefficients: a is the predicted percentage change in Y in the absence of a change in X, and b is the constant of proportionality between percentage changes in X and percentage changes in Y

How to fit in SGWIN Regression: regress DIFF(LOG(Y)) on DIFF(LOG(X)).

How to fit in SGWIN Forecasting: use Y as the input variable, specify an ARIMA model with a natural log transformation, 1 nonseasonal difference, constant=ON, and DIFF(LOG(X)) as a regressor. The latter approach will produce reports and plots of forecasts in unlogged, undifferenced terms.

When to use: when X and Y are nonstationary time series with nonlinear trends and/or heteroscedasticity--e.g., series with inflationary or compound growth such as stock prices. For example, if S is the price index for a particular stock and M is the price index for the entire market, then the "beta" of the stock is the slope coefficient in the regression of DIFF(LOG(S)) on DIFF(LOG(M)).

Note: When working with stock price data, it is often more illuminating to plot the predictions in transformed units (i.e., in terms of returns) rather than in original units (i.e., the level of price index), because this highlights your model's ability (if any!) to predict non-trivial changes in the expected returns from one period to the next. In this case, you may wish to use DIFF(LOG(S)) as the input variable rather than specifying an ARIMA model with a nonseasonal difference and a natural log transformation. Return to top of page.

5. FIRST-ORDER AUTOREGRESSIVE ("AR") MODEL

Assumption: "The level of Y is a linear function of the previous level"

Equation: Ý(t) = a + bY(t-1)

How to interpret the coefficients: a/(1-b) is the mean of Y, and b is the "decay factor" which determines how fast Y tends to revert to its mean following a random shock. (If b is close to zero, the mean-reversion is rapid; while if b is close to one, the mean-reversion is slow.) This can be seen by rewriting the model in terms of deviations from the mean: Ý(t) -m = b(Y(t-1) -m) where m is the mean of Y.

How to fit in SGWIN Regression: regress Y on LAG(Y,1).

How to fit in SGWIN Forecasting: use Y as the input variable and specify an ARIMA model with no differences, AR=1, and constant=ON. The estimated CONSTANT in the model output is "a" while the estimated MEAN is "m", where a/(1-b)=m as noted above. These two parameters are different (only) when an AR term is used in a model.

When to use: when Y is a stationary time series (e.g., a mean-reverting time series) .

Note: if the estimated slope coefficient (b) is not significantly different from one, this is essentially a random walk model with growth. Return to top of page.

6. DIFFERENCED FIRST-ORDER AUTOREGRESSIVE MODEL

Assumption: "The absolute change in Y is a linear function of the previous absolute change"

Equation: (Ý(t) - Y(t-1)) = a + b(Y(t-1) - Y(t-2))

How to interpret the coefficients: a/(1-b) is the average trend in Y (i..e, the mean of DIFF(Y)) and b is the decay factor which determines how fast the trend returns to its average value following a random shock.

How to fit in SGWIN Regression: regress DIFF(Y) on LAG(DIFF(Y),1).

How to fit in SGWIN Forecasting: use Y as the input variable and specify an ARIMA model with 1 nonseasonal difference, AR=1, and constant=ON. The latter approach will produce reports and graphs of forecasts in undifferenced terms.

When to use: when Y is a nonstationary time series whose first difference is positively autocorrelated or slightly negatively autocorrelated--i.e., a "trend-reverting" series--without pronounced heteroscedasticity.

Note: if the estimated slope coefficient (b) is not significantly different from zero, this is essentially a random walk model with growth. If the constant (a) is zero, this model is somewhat similar to a simple exponential smoothing model. If the differenced series shows strong negative autocorrelation at lag 1--say, -0.3 to -0.5--a simple exponential smoothing model may actually fit better. If the differenced series shows very strong negative autocorrelation at lag 1--say, below -0.5--then it may have been differenced unnecessarily. Return to top of page.

7. LOGGED AND DIFFERENCED FIRST-ORDER AR MODEL

Assumption: "The percentage change in Y is a linear function of the previous percentage change"

Equation: (Ý(t) - Y(t-1))/Y(t-1) = a + b(Y(t-1) - Y(t-2))/Y(t-2)

...or... log(Ý(t)) - log(Y(t-1)) = a + b(log(Y(t-1)) - log(Y(t-2)))

How to interpret the coefficients: a/(1-b) is the average percentage increase in Y, and b is the decay factor which determines how fast the percentage increase returns to its average value following a random shock

How to fit in SGWIN Regression: regress DIFF(LOG(Y)) on LAG(DIFF(LOG(Y)),1).

How to fit in SGWIN Forecasting: use Y as the input variable and specify an ARIMA model with a natural log transformation, 1 nonseasonal difference, AR=1, and constant=ON. The latter approach will produce reports and graphs of forecasts in undifferenced, unlogged terms.

When to use: when Y is a nonstationary time series whose first difference is positively autocorrelated or slightly negatively autocorrelated with pronounced heteroscedasticity----i.e., a series which is "trend-reverting" in terms of percentage growth. Return to top of page.

8. LINEAR TREND MODEL

Assumption: "Y is a linear function of time"

Equation: Ý(t)= a + bt

How to interpret the coefficients: a is the estimated value of the series at time 0 (if this means anything!) and b is the trend--i.e., the average increase from one period to the next.

How to fit in SGWIN Regression: regress Y on INDEX ...assuming you have created a time index variable called INDEX

How to fit in SGWIN Forecasting: specify a Linear Trend model.

When to use: when you are really desperate for a linear trend projection and have little data to work with. (A random walk model with growth, a linear exponential smoothing model, or an ARIMA model with a nonseasonal difference and constant=ON is usually superior for extrapolating trends.) Return to top of page.

9. EXPONENTIAL TREND MODEL

Assumption: "Y is an exponential function of time"

Equation: Ý(t) = exp (a+bt)

...or... log (Ý(t)) = a + bt

How to interpret the coefficients: EXP(a) is the estimated value of the series at time 0 (if this means anything) and b is the average percentage increase from one period to the next.

How to fit in SGWIN Regression: regress LOG Y on INDEX

How to fit in SGWIN Forecasting: specify an Exponential Trend model or equivalently a Linear Trend model with a natural log transformation.

When to use: when you are really desperate for an exponential trend projection and have little data to work with. (A random walk model with growth, a linear exponential smoothing model, or an ARIMA model with a nonseasonal difference and constant=ON with a natural log transformation applied to Y is usually superior for extrapolating exponential trends.) Return to top of page.

Additional comments:

(i) These are not "naive" (one-parameter, constant-only) models such as the random walk. Each of these models has two parameters to be estimated: a constant and a coefficient of something else.

(ii) Note that models 5-9 do not use an "X" variable (i.e., another distinct time series). Hence they can be bootstrapped into the future to produce forecasts arbitrarily far ahead when they are implemented in the Forecasting procedure in SGWIN. (The confidence intervals will typically get wider as the forecasting horizon is lengthened.) Models 5-7 are actually special cases of ARIMA models.

(iii) Trend line models such as models 8 and 9 (which use the time index as a variable) should hardly ever be used for purposes of forecasting--a random walk, exponential smoothing , or autoregressive model is nearly always better.

(iv) If X and Y are both non-stationary time series, then a simple (untransformed) regression of Y on X will probably fit poorly, despite yielding an impressive R-squared. A random walk model or a differenced model (e.g., DIFF(Y) on DIFF(X)) or a multiple regression model with lagged variables will almost certainly be better. Don't be discouraged by a low R-squared value for such models: this merely reflects the fact that most of the original variance is removed by differencing. Focus on the standard error of the estimate (root-mean-square error) instead.

Return to top of page.