What's a good value for R-squared?


RegressIt: free Excel add-in for linear regression and multivariate data analysis


The question is often asked: "what's a good value for R-squared?" Sometimes the claim is even made: "a model is not useful unless its R-squared is at least x", where x may be some fraction greater than 50%. By this standard, the model we fitted to the differenced, deflated, and seasonally adjusted auto sales series is disappointing: its R-squared is less than 25%. So what IS a good value for R-squared? The correct answer to this question is polite laughter followed by: "That depends!"

The term R-squared refers to the fraction of variance explained by a model, but--what is the relevant variance that demands explanation? We have seen by now that there are many transformations that may be applied to a variable before it is used as a dependent variable in a regression model: deflation, logging, seasonal adjustment, differencing. All of these transformations will change the variance and may also change the units in which variance is measured. Deflation and logging may dramatically change the units of measurement, while seasonal adjustment and differencing generally reduce the variance significantly when properly applied. Therefore, if the dependent variable in the regression model has already been transformed in some way, it is possible that much of the variance has already been "explained" merely by the choice of an appropriate transformation. Seasonal adjustment obviously tries to explain the seasonal component of the original variance, while differencing tries to explain changes in the local mean of the series over time. With respect to which variance should R-squared be measured--that of the original series, the deflated series, the seasonally adjusted series, and/or the differenced series? This question does not always have a clear-cut answer, and as we will see below, there are usually several reference points that may be of interest in any particular case.


In analyzing the auto sales series, we have used the period 1970-1993 as the estimation period. The variance of the original series over this period is 101.1, measured in units of nominal bazillion dollars squared. (Remember that variance is measured in squared units, because it is the average squared deviation from the mean. A bazillion is a large number like a billion squared.) Now, in general, nominal dollars is not a good unit in which to measure variance because the units become inflated over time. Thus, for example, the variance of the second half of the series is much greater than the variance of the first half, if only because of the effect of inflation. When we are building forecasting models, we are usually looking for statistical properties of the data that are relatively constant, which gives us a basis for extrapolating them into the future. If statistics such as the variance are changing systematically over time, we usually look for transformations (such as deflating or logging) which will make them more constant.

The initial transformation that seems most appropriate the auto sales series is a deflation transformation--i.e., dividing by the CPI. The variance of the deflated series is 16.4, measured in units of 1983 bazillion dollars squared. We might be tempted to say we have already explained a great deal of the variance--and in a sense we have, because much of the variance in the original series is due to inflationary growth--but we can't really directly compare 101.1 against 16.4, because the latter is in arbitrary units which depend on the choice of a base year. However, if we now fit models to the deflated data, it will be meaningful to speak of the fraction of 16.4 that we have managed to explain.

First, let's consider fitting some naive random-walk and random-trend models. A random walk or random trend model is just a "constant" model fitted to some difference of the data--in particular, the random walk model (with growth) is just the constant model fitted to the first difference of the data, the seasonal random walk model is the constant model fitted to the seasonal difference of the data, and the seasonal random trend model is the constant model fitted to the first difference of the seasonal difference. (The constant is usually assumed to be zero in the latter model, but I'll ignore that minor complication here.) In all of these models, the mean squared error is essentially just the variance of the differenced series. Now, in principle we could use the Multiple Regression procedure to fit these models, by using the appropriately differenced series as the dependent variable and estimating a constant only (i.e., no other independent variables). Technically, the R-squared for these models would be zero, since the mean squared error would be precisely equal to the variance of the dependent variable. But it is obviously more meaningful to calculate an "effective" R-squared relative to the variance of the undifferenced (but deflated) series. The following table shows the results of such calculations:

MODEL                                                              MSE       R-squared  

Constant model for AUTOSALE: MSE=VAR(AUTOSALE)                    101.106                              

Constant model for AUTOSALE/CPI: MSE=VAR(AUTOSALE/CPI)             16.4151      0.00%                    

 
SRW model for AUTOSALE/CPI:  MSE=VAR(SDIFF(AUTOSALE/CPI,12))        5.29805    67.72%                    

RW model for AUTOSALE/CPI:   MSE=VAR(DIFF(AUTOSALE/CPI))            3.44537    79.01%                    

SRT model for AUTOSALE/CPI:  MSE= VAR(DIFF(SDIFF(AUTOSALE/CPI,12))  2.8189     82.83%                    

The formula for the effective R-squared calculations for these models is as follows:

R-squared = 1 - MSE/VAR(AUTOSALE/CPI)

where MSE is the mean squared error of the model. Notice that the R-squared values for these naive models are quite respectable, ranging from 68% to almost 83%. The best of these models is the seasonal random trend model, which tracks the seasonality as well as the cyclical variations in the series.


Our next step in fitting a fancier model is to use multiplicative seasonal adjustment on the deflated series. The variance of the seasonally adjusted, deflated series turns out to be 13.49, which is the MSE that would be obtained by applying the constant model to the seasonally adjusted, deflated series. Thus, by adding seasonal adjustment to the constant model for the deflated series, we thereby "explain" 17.81% of its variance. But if we fit a random walk model (with growth) to the seasonally adjusted, deflated series, the MSE drops to 1.90--i.e., we have explained 88.42% of the variance in the undifferenced series. (Note that this is slightly higher than the fraction of variance explained by the seasonal random trend model: the combination of seasonal adjustment and a nonseasonal random walk model appears to to a better job of tracking the average seasonal pattern.) Also, recall that our first regression model fitted to the adjusted, deflated data was a simple regression of AUTOADJ/CPI on INCOME/CPI. This model yielded a MSE of 4.1016 and an R-squared of around 75%, which was inferior to the random walk model.

We now proceed to fit several models in which DIFF(AUTOADJ/CPI) is regressed on lags of itself and other variables. A forward stepwise regression adds the 1st lag of the dependent variable, then the 2nd lag, then 1 lag of DIFF(LEADIND), and finally 1 lag of DIFF(MORTGAGE). The mean squared error of these models is shown along with those of the other models in the table below. Values of R-squared for the models have been calculated versus a number of different reference points. In all cases, the formula for the R-squared calculation is as follows:

R-squared = 1 - MSE/VAR(Y)

where MSE is the mean squared error of the model and Y is the reference variable. (Remember that VAR(Y) is the MSE that would be obtained by applying the constant model to Y. In order for this calculation to be meaningful, it is of course necessary for the errors of the model and the reference series Y to be measured in the same units. ) R-squared column (1) shows the fraction of the variance of AUTOSALE/CPI that is explained--i.e., the relative improvement over the constant model without seasonal adjustment. (The final model explains over 91% of that variance.) R-squared column (2) shows the fraction of the variance of AUTOADJ/CPI that is explained--i.e., the relative improvement over the constant model with seasonal adjustment. (The final model explains over 89% of that variance.) R-squared column (3) shows the fraction of the variance of DIFF(AUTOADJ/CPI) that is explained--i.e., the relative improvement over the random walk model fitted to the adjusted, deflated data. The final model explains "only" 23.49% of this variance, as we pointed out at the very beginning. Thus, although the final model is quite good at explaining the "original" variance (i.e., the variance of deflated auto sales), it doesn't improve dramatically upon the performance of a random walk model fitted to the seasonally adjusted and deflated sales. It is perhaps fair to say the the final model adds "fine tuning" to the random walk model in order to eliminate some of its problems with residual autocorrelation and to bring in the effects of some exogenous variables, namely LEADIND and MORTGAGE.

MODEL                                                              MSE   R-sq (1) R-sq (2) R-sq (3) R=sq (4) 

Constant model for AUTOSALE/CPI: MSE=VAR(AUTOSALE/CPI)         16.4151     0.00%                    

Constant model for AUTOADJ/CPI:  MSE=VAR(AUTOADJ/CPI)          13.4914    17.81%    0.00%                    

AUTOADJ/CPI regressed on INCOME/CPI                             4.1016    75.01%

DIFF(AUTOADJ/CPI) regr'd on 0 vars = RW model for AUTOADJ/CPI   1.90022   88.42%   85.92%    0.00%          

DIFF(AUTOADJ/CPI) regr'd on 1 vars (1 lag dependent variable)   1.67005   89.83%   87.62%   12.11%          

DIFF(AUTOADJ/CPI) regr'd on 2 vars (2 lags dependent variable)  1.51057   90.80%   88.80%   20.51%   0.00% 

DIFF(AUTOADJ/CPI) regr'd on 3 vars (add lag of DIFF(LEADIND))   1.46692   91.06%   89.13%   22.80%   2.89% 

DIFF(AUTOADJ/CPI) regr'd on 4 vars (add lag of DIFF(MORTGAGE))  1.45383   91.14%   89.22%   23.49%   3.76% 

But wait--if we look closer, we notice that most of the work in the final regression model is being done by the two lags of the dependent variable. In other words, nearly all of the "explanation" of the variance is derived from the history of auto sales itself! To make this point, we compute a final R-squared value: column (4) shows the fraction of the variance in the errors of the time series model (the model that uses only the history of deflated auto sales) that is explained. As we see, the two exogenous variables explain less than 4% of this variance.


So, what is a good value for R-squared? It depends on how you measure it! If you measure it as a percentage of the variance of the "original" (e.g., deflated but otherwise untransformed) series, then a simple time series model may achieve an R-squared above 90%. On the other hand, if you measure R-squared as a percentage of a properly stationarized series, then an R-squared of 25% may be quite respectable. (In fact, an R-squared of 10% or even as little as 5% may be statistically significant in some applications, such as predicting stock returns.) If you calculate R-squared as a percentage of the variance in the errors of the best time series model that can be explained by adding exogenous regressors, you may be disillusioned at how small this percentage is! Here it was less than 4%, although this was technically a "statistically significant" reduction, since the coefficients of the additional regressors were significantly different from zero.

What value of R-squared should you report to your boss or client? If you used regression analysis, then to be perfectly candid you should of course include the R-squared for the regression model that was actually fitted--i.e., the fraction of the variance of the dependent variable that was explained--along with the other details of your regression analysis, somewhere in your report. However, if the original series is nonstationary, and if the main goal is to predict the level (rather than the change or the percent change) of the series, then it is perfectly appropriate to also report an "effective" R-squared calculated relative to the variance of the original series (deflated if appropriate), and this number may be the more important number for purposes of characterizing the predictive power of your model. In such cases, it will often be the case that most of the predictive power is derived from the history of the dependent variable (through lags, differences, and/or seasonal adjustment) rather than from exogenous variables. This is the reason why we spent some time studying the properties of time series models before tackling regression models.

What should never happen to you: Don't ever let yourself fall into the trap of fitting a regression model that has a respectable-looking R-squared but is actually very much inferior to a simple time series model. If the dependent variable in your model is a nonstationary time series, be sure that you do a comparison of error measures against an appropriate time series model.