Global TAA

Econometric Tools for Asset Management

Regression Model Basics

We will be interested in forecasting R_t as a function of lagged information Z_t-1. It is logical to start with a linear regression model. Later we discuss the generalization of this linear model using nonparametric density estimation techniques.

The linear regression model is with a single explanatory variable:

         R_t = d₀(Z₀) + d₁(Z_1,t-1) + residual_t  [1]

where d₀, d₁ are regression coefficients.

This is often presented as

             R_t = d₀ + d₁(Z_1,t-1) + residual_t     [2]

The d₀ is interpreted as the intercept and the d₁ as the slope coefficient. Equation [1] and [2] are identical. Remember we have a single explanatory variable. It turns out, in the standard implementation, of regression, that the Z contains two variables: Z₁ might be an interest rate level and Z₀ is a constant vector of ones. In a spread sheet, one can think of the first column as the returns, say from January 1970 through December 1994, the second column has a "1" in every row, the third column is the interest rate from December 1969 through to November 1994 (it is lagged). Notice I have no time subscript on Z₀ because it is just a column of ones.

Suppose we ran the following regression:

         R_t = d₀(Z₀) + residual_t    [3]

This is a regression on the column of ones. What is d₀ in this case? It is just the average return. It is also an equally weighted average return. According to regression theory, the coefficient is

         d₀ = INV(Z'Z)Z'R        [4]

where Z is just a column of ones. This can be broken down into two parts.

                                 1
         INV(Z'Z) = INV(#obs) = ----
                                #obs

            Z'R  = SUM(returns)

Hence, it is obvious that the d₀ is the average return, i.e. the sum divided by the number of observations!

Why are we focussing on this trivial regression? Well, the traditional style of asset management uses average returns (as well as variances and covariances) the mean-variance optimization. Sometimes, moving-window averages (MA) are used, say the last five years. In this case, Z₀ would have zeros in the initial rows and "1"s in the last 60 rows (assuming monthly data is used). Sometimes, exponentially weighted moving averages (EWEMA) are used. Again, we can set the Z₀ to handle this.

What is the R-square of this regression in [3]. Remember, the definition of R-square is the variance of the regression fitted values divided by the variance of the dependent variable. An R-square of 1.0 or 100% implies that the fitted values exactly coincide with the realized returns.

                       Var(fitted)     Var(d₀)
          R-square =  ------------   = -------- = 0
                         Var(R)         Var(R)

The R-square is zero. Why? The variance of a constant, d₀, is exactly zero. Remember definition of variance. It is the squared deviation of the variable from its average. Since the variable is always equal to one, there is no variance.

Another way of looking at this exercise is to note that those using this style of model are assuming that no other Z variable influences future returns. In fact, in running this special regression (and, indeed, you do not need to run a regression, you simply need to push the average button), they are assuming the d₁ and other coefficients are exactly equal to zero.

Using the average as a forecast forces the asset manager implement a strategy with a zero R-square. This is not necessarily a desirable strategy. Indeed, it implies that no other information affects expected returns. It implies that expected returns are constant (at least over the 60-month window of the MA).

Using a more general regression model, we can incorporate predictability. We can execute statistical tests to ensure that the predictability is genuine rather than an artifact of data snooping. The research protocol details procedures that avoid potential misspecification.

Regression Diagnostics

Heteroskedasticity

This problem occurs when the variance of the error term changes through time or across a cross-section of data. As a result, the least squares estimator will be unbiased but inefficient, i.e. you get the right point estimates for the parameters but the variances of the estimated parameters are not minimum variances.

The correction for heteroskedasticity is straight forward and involves weighted least squares. There are a number of approaches to this correction. Basically, each variable is transformed (often by dividing by a variance measure), and the regression is reestimated.

Autocorrelation

Autocorrelation or serial correlation is commonplace in time series regressions. Autocorrelation implies that the errors in previous periods carry over to the present period. Like heteroskedasticity, an autocorrelated regression will have unbiased but inefficient estimators. In fact, the variance of the regression coefficients will by underestimated leading one to falsely believe some parameters are statistically significant. Furthermore, if the model is used to forecast, the predictions will be inefficient (i.e. unnecessarily large sampling variances because we are not using important information -- in the previous error terms).

The solution strategy is to transform the regression variables. Suppose we have the following bivariate regression:

          Y_t = d₀ + d₁ X_t + resid1_t

and suppose the error follows a first-order autoregressive process:

         u_t = r₀ + r₁(u_t-1) + resid2_t

where r₁ is less than one and resid2 is normally distributed and has constant variance. Transform the regression by multiplying the lag of Y by r₁ and taking the first difference:

Y_t - r₁Y_t-1 = d₀(1 - r₁) + d₁(X_t - r₁X_t-1) + (resid1_t - r₁resid1_t-1)

         Y*_t = d₀* + d₁X*_t + resid1*_t

Now the model is properly specified and the estimation can proceed as usual.

Multicollinearity

One of the basic assumptions was that no linear dependence exists between any of the independent variables. The reason that this is important is that we need to invert the matrix, X'X to get the least squares estimator of the coefficients. The problem of multicollinearity does not arise when we have two independent variables that are exactly the same -- because the computer cannot estimate the coefficients. Multicollinearity arises when some of the independent variables are close to being the same.

The main consequence of multicollinearity is that the precision of the estimates deteriorate. It becomes very difficult to determine the relative influences of the independent variables. Investigators may be falsely led to drop variables that are insignificantly different from zero. Furthermore, the coefficient estimates could be sensitive to the block of data used, i.e. the first subperiod could deliver parameter estimates that are different from the second subperiod.

The usual solution strategy is to calculate the correlation matrix of all the independent variables. If two variables have a high degrees of correlation, judgement should be used to determine which one to drop from the regression.

Another possibility is orthogonalization. Suppose Z₁ and Z₂ are correlated but you do not want to drop one of the variables. One can regress Z₂ on Z₁ and save the residuals. A regression could then be run on Z₁ and these residuals. The interpretation of the residuals is that they are the part of Z₂ that is uncorrelated with Z₁.

Omitted Variables

This is a common specification error. In general, the parameter estimates will be biased as a result of omitting important variables. The only case where the bias disappears is if the omitted variable is uncorrelated with the included variables -- this case, however, is unlikely. If the omitted variable has a positive covariance with variable X_i, then the parameter estimate d_i will be biased upward. The omitted variable problem will also affect efficiency. Inference about the coefficients will be wrong because the residual variance is biased upward.

Errors in the Variables

The errors in the variables problem arises when one or more of the independent variables are measured with error. In this case, the parameter estimates will be biased and the degree of bias depends on the variance of the measurement error.

The usual solution strategy is to opt for an instrumental variables estimator rather than ordinary least squares. The properties of this estimator are beyond the scope of this note.

Detection and corrections for conditional heteroskedasticity

Conditional heteroskedasticity occurs when the variance of the error term changes through time. Many financial time-series exhibit heteroskedasticity. Some examples are interest rates and volatilities of stock returns. Using ordinary least squares will deliver the correct estimates for the coefficients but the standard errors and t-statistics will be incorrect. If you are drawing inferences about the coefficients, then you must have the correct standard errors.

It is important to check for heteroskedasticity and to correct for it when it exists. Unfortunately, most computer packages do not have corrections for conditional heteroskedasticity. Most software packages (like Statgraphics) do not deliver corrected standard errors. However, the Fuqua version of Statgraphics has been modified to allow us to get heteroskedasticity consistent standard errors.

Detection

The best method of detection involves saving the residuals from a regression and plotting the residuals against time. If there is an obvious pattern, then it is likely that there is a conditional heteroskedasticity problem. Here are two more sophisticated tests.

Breush-Pagan (1979) Test

Run the regression in question and obtain coefficient estimates.
Save the residuals
Calculate the variance of the residuals, vr
Regress the squared residuals divided by the variance of the residuals on a set of information variables, Z/.
One half the regression sum of squares (i.e. the sum of the squares of the fitted values) is distributed Chi square with s degrees of freedom, where s represents the number of regressors in Z (excluding the constant). A high Chi2 (low probability value) indicates evidence against the hypothesis that the coefficients are zero on the Z variables, which is evidence against the hypothesis of conditional homoskedasticity.

Engle (1982) Test

This is a test for Autoregressive Conditional Heteroskedasticity or ARCH. There is no statistics package available at this time that corrects for this type of heteroskedasticity. But you should be aware of this form -- since many financial time-series exhibit ARCH disturbances.

Run the regression in question and obtain coefficient estimates.
Save the residuals.
Regress the squared residuals on the lags of the squared residuals (you choose the number of lags.
The number of observations time the unadjusted R-square is distributed Chi-square with s degrees of freedom and s represents the number of lags used. A high Chi-square (low probability value) indicates evidence against the hypothesis of conditional homoskedasticity.

White (1980) Test

This is a popular test for Conditional Heteroskedasticity. The steps are as follows:

Run the regression in question and obtain coefficient estimates.
Save the residuals.
Regress the squared residuals on the orginal independent variables plus the cross-products and squares of the original independent variables.
The number of observations times the unadjusted R-square is distributed Chi-square with s degrees of freedom and s represents the number of regressors in the regression (excluding the constant). A high Chi-square (low probability value) indicates evidence against the hypothesis of conditional homoskedasticity.

Popular Corrections

There are three important references for corrections. White (1980, Econometrica) provides the most widely used correction (it is implemented in Statgraphics, SAS, and RATS). Hansen (1982, Econometrica) provides a more general correction whereby White is a special case. Newey and West (1987, Econometrica) provide an alternative to Hansen (in some situations's Hansen's correction will not work, however, you will know that it has not worked because the estimation routine will fail). Finally, Andrews (1991, Econometrica) provides the state-of-the-art correction. My recent research employs the Andrews' correction.