RegressIt: free Excel add-in for
linear regression and multivariate data analysis

The
question is often asked: "what's a good value for R-squared?"
Sometimes the claim is even made: "a model is not useful unless its
R-squared is at least x", where x may be some fraction greater than 50%.
By this standard, the model we fitted to the differenced, deflated, and
seasonally adjusted auto sales series is disappointing: its R-squared is less
than 25%. So what IS a good value for R-squared? The correct answer to this
question is polite laughter followed by: "That depends!"

The term
R-squared refers to the *fraction of variance explained* by a model,
but--what is the relevant variance that demands explanation? We have seen by
now that there are many *transformations* that may be applied to a
variable before it is used as a dependent variable in a regression model:
deflation, logging, seasonal adjustment, differencing. All of these transformations
will change the variance and may also change the *units*
in which variance is measured. Deflation and logging may dramatically change
the units of measurement, while seasonal adjustment and differencing generally
reduce the variance significantly when properly applied. Therefore, if the
dependent variable in the regression model has already been transformed in some
way, it is possible that much of the variance has already been
"explained" merely by the choice of an appropriate transformation.
Seasonal adjustment obviously tries to explain the seasonal component of the
original variance, while differencing tries to explain changes in the local
mean of the series over time. With respect to which variance should R-squared
be measured--that of the original series, the deflated series, the seasonally
adjusted series, and/or the differenced series? This question does not always
have a clear-cut answer, and as we will see below, there are usually several
reference points that may be of interest in any particular case.

In
analyzing the auto sales series, we have used the period 1970-1993 as the
estimation period. The variance of the original series over this period is 101.1,
measured in units of nominal bazillion dollars squared. (Remember that variance
is measured in *squared* units, because it is the average squared
deviation from the mean. A bazillion is a large number like a billion squared.)
Now, in general, nominal dollars is not a good unit in which to measure
variance because the units become inflated over time. Thus, for example, the
variance of the second half of the series is much greater than the variance of
the first half, if only because of the effect of inflation. When we are
building forecasting models, we are usually looking for statistical properties
of the data that are relatively *constant*, which gives us a basis for
extrapolating them into the future. If statistics such as the variance are
changing systematically over time, we usually look for transformations (such as
deflating or logging) which will make them more constant.

The
initial transformation that seems most appropriate the auto sales series is a **deflation**
transformation--i.e., dividing by the CPI. The variance of the deflated series
is 16.4, measured in units of 1983 bazillion dollars squared. We might be
tempted to say we have already explained a great deal of the variance--and in a
sense we have, because much of the variance in the original series is due to
inflationary growth--but we can't really directly compare 101.1 against 16.4,
because the latter is in arbitrary units which depend on the choice of a base
year. However, if we now fit models to the deflated data, it will be meaningful
to speak of the fraction of 16.4 that we have managed to explain.

First,
let's consider fitting some naive random-walk and random-trend models. A random
walk or random trend model is just a "constant" model fitted to some*
difference* of the data--in particular, the random walk model (with growth)
is just the constant model fitted to the* first* difference of the data,
the seasonal random walk model is the constant model fitted to the *seasonal*
difference of the data, and the seasonal random trend model is the constant
model fitted to the first difference of the seasonal difference. (The constant
is usually assumed to be zero in the latter model, but I'll ignore that minor
complication here.) In all of these models, the mean squared error is
essentially just the variance of the differenced series. Now, in principle we
could use the Multiple Regression procedure to fit these models, by using the
appropriately differenced series as the dependent variable and estimating a
constant only (i.e., no other independent variables). Technically, the
R-squared for these models would be *zero*, since the mean squared error
would be precisely equal to the variance of the dependent variable. But it is
obviously more meaningful to calculate an "effective" R-squared
relative to the variance of the undifferenced (but
deflated) series. The following table shows the results of such calculations:

MODEL MSE R-squared

Constant model for AUTOSALE: MSE=VAR(AUTOSALE) 101.106

Constant model for AUTOSALE/CPI: MSE=VAR(AUTOSALE/CPI) 16.4151 0.00%

SRW model for AUTOSALE/CPI: MSE=VAR(SDIFF(AUTOSALE/CPI,12)) 5.29805 67.72%

RW model for AUTOSALE/CPI: MSE=VAR(DIFF(AUTOSALE/CPI)) 3.44537 79.01%

SRT model for AUTOSALE/CPI: MSE= VAR(DIFF(SDIFF(AUTOSALE/CPI,12)) 2.8189 82.83%

The formula for the
effective R-squared calculations for these models is as follows:

**R-squared = 1 -
MSE/VAR(AUTOSALE/CPI)**

where MSE is the mean
squared error of the model. Notice that the R-squared values for these naive
models are quite respectable, ranging from 68% to almost 83%. The best of these
models is the seasonal random trend model, which tracks the seasonality as well
as the cyclical variations in the series.

Our next step in
fitting a fancier model is to use **multiplicative seasonal adjustment** on the
deflated series. The variance of the seasonally adjusted, deflated series turns
out to be 13.49, which is the MSE that would be obtained by applying the
constant model to the seasonally adjusted, deflated series. Thus, by adding
seasonal adjustment to the constant model for the deflated series, we thereby
"explain" 17.81% of its variance. But if we fit a *random walk*
model (with growth) to the seasonally adjusted, deflated series, the MSE drops
to 1.90--i.e., we have explained 88.42% of the variance in the undifferenced series. (Note that this is slightly higher
than the fraction of variance explained by the seasonal random trend model: the
combination of seasonal adjustment and a nonseasonal
random walk model appears to to a better job of
tracking the average seasonal pattern.) Also, recall that our first regression
model fitted to the adjusted, deflated data was a simple regression of
AUTOADJ/CPI on INCOME/CPI. This model yielded a MSE of 4.1016 and an R-squared
of around 75%, which was inferior to the random walk model.

We now proceed to
fit several models in which DIFF(AUTOADJ/CPI) is
regressed on lags of itself and other variables. A forward stepwise regression
adds the 1st lag of the dependent variable, then the 2nd lag, then 1 lag of DIFF(LEADIND), and finally 1 lag of DIFF(MORTGAGE). The mean
squared error of these models is shown along with those of the other models in
the table below. Values of R-squared for the models have been calculated versus
a number of different reference points. In all cases, the formula for the
R-squared calculation is as follows:

**R-squared = 1 -
MSE/VAR(Y)**

where MSE is the mean
squared error of the model and Y is the reference variable. (Remember that
VAR(Y) is the MSE that would be obtained by applying the constant model to Y.
In order for this calculation to be meaningful, it is of course necessary for
the errors of the model and the reference series Y to be measured in the *same
units*. ) R-squared column (1) shows the fraction of the variance of
AUTOSALE/CPI that is explained--i.e., the relative improvement over the
constant model *without* seasonal adjustment. (The final model explains
over 91% of that variance.) R-squared column (2) shows the fraction of the
variance of AUTOADJ/CPI that is explained--i.e., the relative improvement over
the constant model *with* seasonal adjustment. (The final model explains
over 89% of that variance.) R-squared column (3) shows the fraction of the
variance of DIFF(AUTOADJ/CPI) that is explained--i.e.,
the relative improvement over the *random walk* model fitted to the
adjusted, deflated data. The final model explains "only" 23.49% of
this variance, as we pointed out at the very beginning. Thus, although the
final model is quite good at explaining the "original" variance
(i.e., the variance of deflated auto sales), it doesn't improve dramatically
upon the performance of a random walk model fitted to the seasonally adjusted
and deflated sales. It is perhaps fair to say the the
final model adds "fine tuning" to the random walk model in order to
eliminate some of its problems with residual autocorrelation and to bring in
the effects of some exogenous variables, namely LEADIND and MORTGAGE.

MODEL MSE R-sq (1) R-sq (2) R-sq (3) R=sq (4)

Constant model for AUTOSALE/CPI: MSE=VAR(AUTOSALE/CPI) 16.4151 0.00%

Constant model for AUTOADJ/CPI: MSE=VAR(AUTOADJ/CPI) 13.4914 17.81% 0.00%

AUTOADJ/CPI regressed on INCOME/CPI 4.1016 75.01%

DIFF(AUTOADJ/CPI) regr'd on 0 vars = RW model for AUTOADJ/CPI 1.90022 88.42% 85.92% 0.00%

DIFF(AUTOADJ/CPI) regr'd on 1 vars (1 lag dependent variable) 1.67005 89.83% 87.62% 12.11%

DIFF(AUTOADJ/CPI) regr'd on 2 vars (2 lags dependent variable) 1.51057 90.80% 88.80% 20.51% 0.00%

DIFF(AUTOADJ/CPI) regr'd on 3 vars (add lag of DIFF(LEADIND)) 1.46692 91.06% 89.13% 22.80% 2.89%

DIFF(AUTOADJ/CPI) regr'd on 4 vars (add lag of DIFF(MORTGAGE)) 1.45383 91.14% 89.22% 23.49% 3.76%

But wait--if we
look closer, we notice that most of the work in the final regression model is
being done by the two lags of the dependent variable. In other words, nearly
all of the "explanation" of the variance is derived from the history
of auto sales itself! To make this point, we compute a final R-squared value:
column (4) shows the fraction of the variance in the errors of the time series
model (the model that uses only the history of deflated auto sales) that is
explained. As we see, the two exogenous variables explain *less than 4% *of
this variance.

**So, what is a good
value for R-squared? **It
depends on how you measure it! If you measure it as a percentage of the
variance of the "original" (e.g., deflated but otherwise
untransformed) series, then a simple time series model may achieve an R-squared
above 90%. On the other hand, if you measure R-squared as a percentage of a
properly *stationarized* series, then an R-squared of 25% may be quite
respectable. (In fact, an R-squared of 10% or even as little as 5% may be
statistically significant in some applications, such as predicting stock
returns.) If you calculate R-squared as a percentage of the variance in the
errors of the *best time series model* that can be explained by adding
exogenous regressors, you may be disillusioned at how
small this percentage is! Here it was less than 4%, although this was
technically a "statistically significant" reduction, since the
coefficients of the additional regressors were
significantly different from zero.

**What value of
R-squared should you report to your boss or client? **If you used
regression analysis, then to be perfectly candid you should of course include
the R-squared for the regression model that was actually fitted--i.e., the
fraction of the variance of the dependent variable that was explained--along
with the other details of your regression analysis, somewhere in your report.
However, if the original series is nonstationary, and
if the main goal is to predict the *level* (rather than the change or the
percent change) of the series, then it is perfectly appropriate to also report
an "effective" R-squared calculated relative to the variance of the
original series (deflated if appropriate), and this number may be the more
important number for purposes of characterizing the predictive power of your
model. In such cases, it will often be the case that most of the predictive
power is derived from the history of the dependent variable (through lags,
differences, and/or seasonal adjustment) rather than from exogenous variables.
This is the reason why we spent some time studying the properties of time
series models before tackling regression models.

**What should never
happen to you: **Don't
ever let yourself fall into the trap of fitting a regression model that has a
respectable-looking R-squared but is actually very much inferior to a simple
time series model. If the dependent variable in your model is a nonstationary time series, be sure that you do a comparison
of error measures against an appropriate time series model.