The question is often asked: "what's a good value for R-squared?" Sometimes the claim is even made: "a model is not useful unless its R-squared is at least x", where x may be some fraction greater than 50%. By this standard, the model we fitted to the differenced, deflated, and seasonally adjusted auto sales series is disappointing: its R-squared is less than 25%. So what IS a good value for R-squared? The correct answer to this question is polite laughter followed by: "That depends!"

The term R-squared refers to
the *fraction of variance explained*
by a model, but--what is the relevant variance that demands
explanation?
We have seen by now that there are many *transformations*
that may be applied to a variable before it is used as a dependent
variable in a regression model: deflation, logging, seasonal
adjustment,
differencing. All of these transformations will change the variance
and may also change the *units* in which variance is measured.
Deflation and logging may dramatically change the units of measurement,
while seasonal adjustment and differencing generally reduce the
variance significantly when properly applied. Therefore, if the
dependent variable in the regression model has already been transformed
in some way, it is possible that much of the variance has already
been "explained" merely by the choice of an appropriate
transformation. Seasonal adjustment obviously tries to explain
the seasonal component of the original variance, while differencing
tries to explain changes in the local mean of the series over
time. With respect to which variance should R-squared be measured--that
of the original series, the deflated series, the seasonally adjusted
series, and/or the differenced series? This question does not
always have a clear-cut answer, and as we will see below, there
are usually several reference points that may be of interest in
any particular case.

In analyzing the auto sales
series, we have used the period 1970-1993
as the estimation period. The variance of the original series
over this period is 101.1, measured in units of nominal bazillion
dollars squared. (Remember that variance is measured in *squared*
units, because it is the average squared deviation from the mean.
A bazillion is a large number like a billion squared.) Now, in
general, nominal dollars is not a good unit in which to measure
variance because the units become inflated over time. Thus, for
example, the variance of the second half of the series is much
greater than the variance of the first half, if only because of
the effect of inflation. When we are building forecasting models,
we are usually looking for statistical properties of the data
that are relatively *constant*, which gives us a basis for
extrapolating them into the future. If statistics such as the
variance are changing systematically over time, we usually look
for transformations (such as deflating or logging) which will
make them more constant.

The initial transformation that
seems most appropriate the auto
sales series is a **deflation** transformation--i.e., dividing
by the CPI. The variance of the deflated series is 16.4, measured
in units of 1983 bazillion dollars squared. We might be tempted
to say we have already explained a great deal of the variance--and
in a sense we have, because much of the variance in the original
series is due to inflationary growth--but we can't really directly
compare 101.1 against 16.4, because the latter is in arbitrary
units which depend on the choice of a base year. However, if we
now fit models to the deflated data, it will be meaningful to
speak of the fraction of 16.4 that we have managed to explain.

First, let's consider fitting
some naive random-walk and random-trend
models. A random walk or random trend model is just a "constant"
model fitted to some* difference* of the data--in particular,
the random walk model (with growth) is just the constant model
fitted to the* first* difference of the data, the seasonal
random walk model is the constant model fitted to the *seasonal*
difference of the data, and the seasonal random trend model is
the constant model fitted to the first difference of the seasonal
difference. (The constant is usually assumed to be zero in the
latter model, but I'll ignore that minor complication here.) In
all of these models, the mean squared error is essentially just
the variance of the differenced series. Now, in principle we could
use the Multiple Regression procedure to fit these models, by
using the appropriately differenced series as the dependent variable
and estimating a constant only (i.e., no other independent variables).
Technically, the R-squared for these models would be *zero*,
since the mean squared error would be precisely equal to the variance
of the dependent variable. But it is obviously more meaningful
to calculate an "effective" R-squared relative to the
variance of the undifferenced (but deflated) series. The following
table shows the results of such calculations:

MODEL MSE R-squared

Constant model for AUTOSALE: MSE=VAR(AUTOSALE) 101.106

Constant model for AUTOSALE/CPI: MSE=VAR(AUTOSALE/CPI) 16.4151 0.00%

SRW model for AUTOSALE/CPI: MSE=VAR(SDIFF(AUTOSALE/CPI,12)) 5.29805 67.72%

RW model for AUTOSALE/CPI: MSE=VAR(DIFF(AUTOSALE/CPI)) 3.44537 79.01%

SRT model for AUTOSALE/CPI: MSE= VAR(DIFF(SDIFF(AUTOSALE/CPI,12)) 2.8189 82.83%

The formula for the effective R-squared calculations for these models is as follows:

**R-squared = 1 -
MSE/VAR(AUTOSALE/CPI)**

where MSE is the mean squared error of the model. Notice that the R-squared values for these naive models are quite respectable, ranging from 68% to almost 83%. The best of these models is the seasonal random trend model, which tracks the seasonality as well as the cyclical variations in the series.

Our next step in fitting a
fancier model is to use **multiplicative
seasonal adjustment** on the deflated series. The variance of
the seasonally adjusted, deflated series turns out to be 13.49,
which is the MSE that would be obtained by applying the constant
model to the seasonally adjusted, deflated series. Thus, by adding
seasonal adjustment to the constant model for the deflated series,
we thereby "explain" 17.81% of its variance. But if
we fit a *random walk* model (with growth) to the seasonally
adjusted, deflated series, the MSE drops to 1.90--i.e., we have
explained 88.42% of the variance in the undifferenced series.
(Note that this is slightly higher than the fraction of variance
explained by the seasonal random trend model: the combination
of seasonal adjustment and a nonseasonal random walk model appears
to to a better job of tracking the average seasonal pattern.)
Also, recall that our first regression model fitted to the adjusted,
deflated data was a simple regression of AUTOADJ/CPI on INCOME/CPI.
This model yielded a MSE of 4.1016 and an R-squared of around
75%, which was inferior to the random walk model.

We now proceed to fit several models in which DIFF(AUTOADJ/CPI) is regressed on lags of itself and other variables. A forward stepwise regression adds the 1st lag of the dependent variable, then the 2nd lag, then 1 lag of DIFF(LEADIND), and finally 1 lag of DIFF(MORTGAGE). The mean squared error of these models is shown along with those of the other models in the table below. Values of R-squared for the models have been calculated versus a number of different reference points. In all cases, the formula for the R-squared calculation is as follows:

**R-squared = 1 - MSE/VAR(Y)**

where MSE is the mean squared
error of the model and Y is the
reference variable. (Remember that VAR(Y) is the MSE that would
be obtained by applying the constant model to Y. In order for
this calculation to be meaningful, it is of course necessary for
the errors of the model and the reference series Y to be measured
in the *same units*. ) R-squared column (1) shows the fraction
of the variance of AUTOSALE/CPI that is explained--i.e., the relative
improvement over the constant model *without* seasonal
adjustment.
(The final model explains over 91% of that variance.) R-squared
column (2) shows the fraction of the variance of AUTOADJ/CPI that
is explained--i.e., the relative improvement over the constant
model *with* seasonal adjustment. (The final model explains
over 89% of that variance.) R-squared column (3) shows the fraction
of the variance of DIFF(AUTOADJ/CPI) that is explained--i.e.,
the relative improvement over the *random walk* model fitted
to the adjusted, deflated data. The final model explains "only"
23.49% of this variance, as we pointed out at the very beginning.
Thus, although the final model is quite good at explaining the
"original" variance (i.e., the variance of deflated
auto sales), it doesn't improve dramatically upon the performance
of a random walk model fitted to the seasonally adjusted and deflated
sales. It is perhaps fair to say the the final model adds "fine
tuning" to the random walk model in order to eliminate some
of its problems with residual autocorrelation and to bring in
the effects of some exogenous variables, namely LEADIND and MORTGAGE.

MODEL MSE R-sq (1) R-sq (2) R-sq (3) R=sq (4)

Constant model for AUTOSALE/CPI: MSE=VAR(AUTOSALE/CPI) 16.4151 0.00%

Constant model for AUTOADJ/CPI: MSE=VAR(AUTOADJ/CPI) 13.4914 17.81% 0.00%

AUTOADJ/CPI regressed on INCOME/CPI 4.1016 75.01%

DIFF(AUTOADJ/CPI) regr'd on 0 vars = RW model for AUTOADJ/CPI 1.90022 88.42% 85.92% 0.00%

DIFF(AUTOADJ/CPI) regr'd on 1 vars (1 lag dependent variable) 1.67005 89.83% 87.62% 12.11%

DIFF(AUTOADJ/CPI) regr'd on 2 vars (2 lags dependent variable) 1.51057 90.80% 88.80% 20.51% 0.00%

DIFF(AUTOADJ/CPI) regr'd on 3 vars (add lag of DIFF(LEADIND)) 1.46692 91.06% 89.13% 22.80% 2.89%

DIFF(AUTOADJ/CPI) regr'd on 4 vars (add lag of DIFF(MORTGAGE)) 1.45383 91.14% 89.22% 23.49% 3.76%

But wait--if we look closer, we notice that most of the work in
the final regression model is being done by the two lags of the
dependent variable. In other words, nearly all of the "explanation"
of the variance is derived from the history of auto sales itself!
To make this point, we compute a final R-squared value: column
(4) shows the fraction of the variance in the errors of the time
series model (the model that uses only the history of deflated
auto sales) that is explained. As we see, the two exogenous variables
explain *less than 4% *of this variance.

**So, what is a good value for
R-squared? **It depends on how
you measure it! If you measure it as a percentage of the variance
of the "original" (e.g., deflated but otherwise untransformed)
series, then a simple time series model may achieve an R-squared
above 90%. On the other hand, if you measure R-squared as a percentage
of a properly *stationarized* series, then an R-squared of
25% may be quite respectable. (In fact, an R-squared of 10% or
even as little as 5% may be statistically significant in some
applications, such as predicting stock returns.) If you calculate
R-squared as a percentage of the variance in the errors of the
*best time series model* that can be explained by adding
exogenous regressors, you may be disillusioned at how small this
percentage is! Here it was less than 4%, although this was technically
a "statistically significant" reduction, since the coefficients
of the additional regressors were significantly different from
zero.

**What value of R-squared
should you report to your boss or client?
**If you used regression analysis, then to be perfectly candid
you should of course include the R-squared for the regression
model that was actually fitted--i.e., the fraction of the variance
of the dependent variable that was explained--along with the other
details of your regression analysis, somewhere in your report.
However, if the original series is nonstationary, and if the main
goal is to predict the *level* (rather than the change or
the percent change) of the series, then it is perfectly appropriate
to also report an "effective" R-squared calculated relative
to the variance of the original series (deflated if appropriate),
and this number may be the more important number for purposes
of characterizing the predictive power of your model. In such
cases, it will often be the case that most of the predictive power
is derived from the history of the dependent variable (through
lags, differences, and/or seasonal adjustment) rather than from
exogenous variables. This is the reason why we spent some time
studying the properties of time series models before tackling
regression models.

**What should never happen to
you: **Don't ever let yourself
fall into the trap of fitting a regression model that has a
respectable-looking
R-squared but is actually very much inferior to a simple time
series model. If the dependent variable in your model is a
nonstationary
time series, be sure that you do a comparison of error measures
against an appropriate time series model.