What’s a good
value for R-squared?
What's the bottom line? How to compare models
Testing the assumptions of linear regression
Additional notes on regression analysis
Spreadsheet with regression formulas (new version including RegressIt output)
Stepwise and all-possible-regressions
RegressIt: free Excel add-in for linear regression and multivariate data analysis
The question is often asked: "what's a good value for R-squared?" or “how big does R-squared need to be for the regression model to be valid?” Sometimes the claim is even made: "a model is not useful unless its R-squared is at least x", where x may be some fraction greater than 50%. The correct response to this question is polite laughter followed by: "That depends!" A former student of mine landed a job at a top consulting firm by being the only candidate who gave that answer during his interview.
R-squared is the “percent of variance explained” by the model. That is, R-squared is the fraction by which the variance of the errors is less than the variance of the dependent variable. (The latter number would be the error variance for an intercept-only model.) It is called R-squared because in simple regression model it is just the square of the correlation between the dependent and independent variables, which is commonly denoted by “r”. In a multiple regression model R-squared depends on the pairwise correlations among all the independent variables.
Now, what is the relevant variance that requires explanation, and how much or how little explanation is necessary or useful? We have seen by now that there are many transformations that may be applied to a variable before it is used as a dependent variable in a regression model: deflation, logging, seasonal adjustment, differencing. All of these transformations will change the variance and may also change the units in which variance is measured. Deflation and logging may dramatically change the units of measurement, while seasonal adjustment and differencing generally reduce the variance significantly when properly applied. Therefore, if the dependent variable in the regression model has already been transformed in some way, it is possible that much of the variance has already been "explained" merely by the choice of an appropriate transformation. Seasonal adjustment obviously tries to explain the seasonal component of the original variance, while differencing tries to explain changes in the local mean of the series over time. With respect to which variance should R-squared be measured--that of the original series, the deflated series, the seasonally adjusted series, and/or the differenced series? This question does not always have a clear-cut answer, and as we will see below, there are usually several reference points that may be of interest in any particular case.
Variance is a hard quantity to think about because it is measured in squared units (dollars squared, widgets squared….). It is easier to think in terms of standard deviations, because they are measured in the same units as the variables and they directly determine the widths of confidence intervals. So, it is more useful to measure “percent of standard deviation explained,” i.e., the percent by which the standard deviation of the errors is less than the standard deviation of the dependent variable. This is equal to one minus the square root of 1-minus-R-squared. Here is a table that shows the conversion:
For example, if the model’s R-squared is 90%, the variance of its errors is 90% less than the variance of the dependent variable and the standard deviation of its errors is 68% less than the standard deviation of the dependent variable. That is, the standard deviation of the regression model’s errors is about 1/3 the size of the standard deviation of the errors that you would get with an intercept-only model. That’s very good, but it doesn’t sound quite as impressive as “NINETY PERCENT EXPLAINED!”. If the model’s R-squared is 75%, the standard deviation of the errors is exactly one-half of the standard deviation of the dependent variable. Notice that for small values (R-squared less than 25%), the percent of standard deviation explained is roughly one-half of the percent of variance explained.
Generally it is better to look at adjusted R-squared rather than R-squared and to look at the standard error of the regression rather than the standard deviation of the errors. However, the difference between them is usually very small unless you are trying to estimate too many coefficients from too small a sample. Adjusted R-squared bears the same relation to the standard error of the regression that R-squared bears to the standard deviation of the errors, so the same table applies to the former pair as well as the latter pair.
How big an R-squared is “big enough”? That depends on the decision-making situation, and it depends on your objectives or needs, and it depends on how the dependent variable is defined. In some situations it might be reasonable to hope and expect to explain 99% of the variance, or equivalently 90% of the standard deviation of the dependent variable. In other cases, you might consider yourself to be doing very well if you explained 10% of the variance, or equivalently 5% of the standard deviation, or perhaps even less. The following section gives an example that highlights these issues. If you want to skip the example and go straight to the concluding comments, click here.
An example in which R-squared is a poor guide to analysis: Consider the U.S. monthly auto sales series that was used for illustration in the first chapter of these notes, whose graph is reproduced here:
The units are $billions and the date range shown here is from January 1970 to February 1996. Suppose that the objective of the analysis is to predict monthly auto sales from monthly total personal income. I am using these variables (and this antiquated date range) for two reasons: (i) this very example was used to illustrate the benefits of regression analysis in a textbook that I was using in that era, and (ii) I have seen many students undertake self-designed forecasting projects in which they have blindly fitted regression models using macroeconomic indicators such as personal income, gross domestic product, unemployment, and stock prices as predictors of nearly everything, the logic being that they reflect the general state of the economy and therefore have implications for every kind of business activity. Perhaps so, but the question is whether they do it in a linear, additive fashion that stands out against the background noise in the variable that is to be predicted and which yields useful predictions and inferences in the process, in comparison to other ways in which you might choose to spend your time. Return to top of page.
The corresponding graph of personal income (also in $billions) looks like this:
There is no seasonality in the income data. In fact, there is almost no pattern in it at all except for a trend that increased slightly in the earlier years. (This is not a good sign if we hope to get forecasts that have any specificity.) By comparison, the seasonal pattern is the most striking feature in the auto sales, so the first thing that needs to be done is to seasonally adjust the latter. Seasonally adjusted auto sales (independently obtained from the same government source) and personal income line up like this when plotted on the same graph:
The strong and generally similar-looking trends suggest that we will get a very high value of R-squared if we regress sales on income, and indeed we do. Here is the summary table for that regression:
Adjusted R-squared is almost 97%! However, a result like this is to be expected when regressing a strongly trended series on any other strongly trended series, regardless of whether they are logically related. Here are the line fit plot and residuals-vs-time plot for the model:
The residual-vs-time plot indicates that the model has a couple of terrible problems. First, the variance of the errors increases steadily over time, which means that confidence intervals for forecasts in the near future will be way too narrow (being based on average error sizes over the whole history of the series). The reason for the increase in variance is that random variations in auto sales (like most other measures of macroeconomic activity) tend to be consistent over time in percentage terms rather than absolute terms, and the absolute level of the series has risen dramatically due to a combination of inflationary growth and real growth. As the level as grown, the variance of the random fluctuations has grown with it. Second, there is very strong positive autocorrelation in the errors, i.e., a tendency to make the same error many times in a row. In fact, the lag-1 autocorrelation is 0.77 for this model. It is clear why this happens: the two curves do not have exactly the same shape. The trend in the auto sales series tends to vary over time while the trend in income is much more consistent. To make matters even worse, the model’s largest errors have occurred in the last few periods (at the “business end” of the data, as I like to say), which means we can expect the forecast errors to be huge in the immediate future too. So, despite the high value of R-squared, this is a very bad model. Return to top of page.
One way to try to improve the model would be to deflate both series first. This would at least eliminate the inflationary component of growth, which hopefully will make the variance of the errors more consistent over time. Here is a time series plot showing auto sales and personal income after they have been deflated by dividing them by the U.S. all-product consumer price index (CPI) at each point in time, with the CPI normalized to a value of 1.0 in February 1996 (the last row of the data). This does indeed flatten out the trend somewhat, and it also brings out some fine detail in the month-to-month variations that was not so apparent on the original plot. In particular, we begin to see some small bumps and wiggles in the income data that roughly line up with larger bumps and wiggles in the auto sales data.
If we fit a simple regression model to these two variables, the following results are obtained:
Adjusted R-squared is only 0.788 for this model, which is worse, right? Well, no. We “explained” some of the variance in the original data by deflating it prior to fitting this model. Because the dependent variables are not the same, it is not appropriate to do a head-to-head comparison of R-squared. Arguably this is a better model, because it separates out the real growth in sales from the inflationary growth, and also because the errors have a more consistent variance over time. (The latter issue is not the bottom line, but it is a step in the direction of fixing the model assumptions.) Most interestingly, the deflated income data shows some fine detail that matches up with similar patterns in the sales data. However, the error variance is still a long way from being constant over the full two-and-a-half decades, and the problems of badly autocorrelated errors and a particularly bad fit to the most recent data have not been solved.
Another statistic that we might be tempted to compare between these two models is the standard error of the regression, which normally is the best bottom-line statistic to focus on. The second model’s standard error is much larger: 3.253 vs. 2.218 for the first model. But wait… these two numbers cannot be directly compared, either, because they are not measured in the same units. The standard error of the first model is measured in units of current dollars, while the standard error of the second model is measured in units of 1996 dollars. Those were decades of high inflation, and 1996 dollars were not worth nearly as much as dollars were worth in the earlier years. (In fact, a 1996 dollar was only worth about one-quarter of a 1970 dollar.) Return to top of page.
Let’s now try something totally different: fitting a simple time series model to the deflated data. In particular, let’s fit a random-walk-with-drift model, which is logically equivalent to fitting an intercept-only model to the first difference (period to period change) in the original series. Let the differenced series be called AUTOSALES_SADJ_1996_DOLLARS_DIFF1 (which is the name that would be automatically assigned in RegressIt). Notice that we are now 3 levels deep in data transformations: seasonal adjustment, deflation, and differencing! This sort of situation is very common in time series analysis. Here are the results of fitting this model, in which AUTOSALES_SADJ_1996_DOLLARS_DIFF1 is the dependent variables and there are no independent variables, just the intercept. This model merely predicts that each monthly difference will be the same, i.e., it predicts constant growth relative to the previous month’s value.
Adjusted R-squared has dropped to zero! This is not a problem: an intercept-only regression always has an R-squared of zero, but that doesn’t necessarily imply that it is not a good model for the particular dependent variable that has been used. We should look instead at the standard error of the regression. The units and sample of the dependent variable are the same for this model as for the previous one, so their regression standard errors can be legitimately compared. The regression standard error of this model is only 2.111, compared to 3.253 for the previous one, a reduction of roughly one-third, which is a very significant improvement. (The residual-vs-time plot for this model and the previous one have the same vertical scaling: look at them both and compare the size of the errors, particularly those that have occurred recently.) The reason why this model’s forecasts are so much more accurate is that it looks at last month’s actual sales values, whereas the previous model only looked at personal income data. It is often the case that the best information about where a time series is going to go next is where it has been lately.
There is no line fit plot for this model, because there is no independent variable, but here is the residual-versus-time plot:
These residuals look quite random to the naked eye, but they actually exhibit negative autocorrelation, i.e., a tendency to alternate between overprediction and underprediction from one month to the next. (The lag-1 autocorrelation here is -0.356.) This often happens when differenced data is used, but overall the errors of this model are much closer to being independently and identically distributed than those of the previous two, so we can have a good deal more confidence in any confidence intervals for forecasts that may be computed from it. The model does not shed light on the relationship between personal income and auto sales, but neither did the previous two models, to be perfectly honest.
So, what is the relationship between auto sales and personal income? That is a complex question and it will not be further pursued here except to note that there some other simple things we could do besides fitting a regression model. For example, we could compute the percentage of income spent on automobiles over time, i.e., just divide the auto sales series by the personal income series and see what the pattern looks like. Here is the resulting picture:
This chart nicely illustrates cyclical variations in the fraction of income spent on autos, which would be interesting to try to match up with other explanatory variables.
The bottom line here is that R-squared was not of any use in guiding us through this particular analysis toward better and better models. At various stages of the analysis, data transformations were suggested: seasonal adjustment, deflating, differencing. (Logging was not tried here, but would have been an alternative to deflation.) And every time the dependent variable is transformed, it becomes impossible to make meaningful before-and-after comparisons of R-squared. Furthermore, regression was probably not even the best tool to use here in order to study the relation between the two variables. It is not a “universal wrench” that should be used on every problem. Return to top of page.
So, what IS a good value for R-squared? It depends on the variable with respect to which you measure it, and it depends on the decision-making context. If the dependent variable is a nonstationary (e.g., trending or random-walking) time series, an R-squared value very close to 1 (such as the 97% figure obtained in the first model above) may not be very impressive. On the other hand, if the dependent variables is a properly stationarized series (e.g., differences or percentage differences rather than levels), then an R-squared of 25% may be quite good. In fact, an R-squared of 10% or even less could have some information value when you are looking for a weak signal in the presence of a lot of noise and the situation is such that even a weak signal would be informative or valuable.
However, it is very important to do honest out-of-sample testing of models with very low values of R-squared, i.e., see how they perform when applied to a substantial sample of data that was not used either in the descriptive phase or the estimation phase of the analysis, to be sure the data was not merely over-fitted. It is easy to find “spurious” (i.e., accidental) correlations if you go on an extended fishing expedition in a large pool of variables. I have had students attempt to predict stock returns using regression models--which I do not recommend--and it is not uncommon for them to find models that yield R-squared values in the range of 5% to 10%, but they virtually never survive out-of-sample testing. (Buy the market index instead!)
Some software has built-in features for out-of-sample testing. If yours doesn’t, then you will need to use its forecasting option to generate forecasts for values of the dependent variable that have been excluded from the sample that has been fitted. Then do your own calculations of the errors and their root-mean-squared value, and compare that root-mean-squared value against the standard error of the regression. Ideally it should be approximately the same, or at least not too much larger in percentage terms. These calculations are not hard to do on a spreadsheet.
When working with time series data, if you compare the standard deviation of the errors of a regression model which uses exogenous predictors against that of a simple time series model (say, an autoregressive or exponential smoothing or random walk model), you may be disappointed by what you find. If the variable to be predicted is a time series, it will often be the case that most of the predictive power is derived from its own history via lags, differences, and/or seasonal adjustment. This is the reason why we spent some time studying the properties of time series models before tackling regression models.
A rule of thumb for small values of R-squared: If R-squared is small (say 25% or less), then the fraction by which the standard deviation deviation of the errors is less than the standard deviation of the dependent variable is approximately one-half of R-squared, as shown in the table above. So, for example, if your model has an R-squared of 10%, then its errors are only about 5% smaller on average than those of an intercept-only model, which merely predicts that everything will equal the mean. Another handy reference point: if the model has an R-squared of 75%, its errors are 50% smaller on average than those of an intercept-only model. (This is not an approximation: it follows directly from the fact that reducing the error standard deviation to ½ of its former value is equivalent to reducing its variance to ¼ of its former value.)
What value of R-squared should you report to your boss or client? If you used regression analysis, then to be perfectly candid you should of course include the R-squared for the regression model that was actually fitted, along with other details of the output, somewhere in your report. You should more strongly emphasize the standard error of the regression, though, because that measures the predictive accuracy of the model in real terms, and it scales the width of all confidence intervals calculated from the model. You may also want to report other practical measures of error size such as the mean absolute error or mean absolute percentage error.
What should never happen to you: Don't ever let yourself fall into the trap of fitting (and then promoting!) a regression model that has a respectable-looking R-squared but is actually very much inferior to a simple time series model. If the dependent variable in your model is a nonstationary time series, be sure that you do a comparison of error measures against an appropriate time series model. Remember that what R-squared measures is the proportional reduction in error variance that the regression model achieves in comparison to an intercept-only model (i.e., mean model) fitted to the same dependent variable, but the intercept-only model may not be the most appropriate reference point, and the dependent variable you end up using may not be the one you started with if data transformations turn out to be important.
And finally: R-squared is not the bottom line. You don’t get paid in proportion to R-squared. The real bottom line in your analysis is measured by consequences of decisions that you and others will make on the basis of it. In general, the important criteria for a good regression model are (a) to make the smallest possible errors, in real terms, when predicting the future, and (b) to derive useful inferences from the structure of the model and the estimated values of its parameters.
Updated on September 17, 2014