Notes on linear
regression analysis (pdf file)
Introduction
to linear regression analysis
Mathematics
of simple regression
Regression examples
·
Beer sales vs. price, part 1: descriptive
analysis
·
Beer sales vs. price, part 2: fitting a simple
model
·
Beer sales vs. price, part 3: transformations
of variables
·
Beer sales vs.
price, part 4: additional predictors
·
NC natural gas
consumption vs. temperature
·
More regression datasets
at regressit.com
What to look for in
regression output
What’s a good
value for R-squared?
What's the bottom line? How to compare models
Testing the assumptions of linear regression
Additional notes on regression
analysis
Stepwise and all-possible-regressions
Excel file with
simple regression formulas
Excel file with regression
formulas in matrix form
Notes on logistic regression (new!)
If you use Excel
in your work or in your teaching to any extent, you should check out the latest
release of RegressIt, a free Excel add-in for linear and logistic regression.
See it at regressit.com. The linear regression version runs on both PC's and Macs and
has a richer and easier-to-use interface and much better designed output than
other add-ins for statistical analysis. It may make a good complement if not a
substitute for whatever regression software you are currently using,
Excel-based or otherwise. RegressIt is an excellent tool for
interactive presentations, online teaching of regression, and development of
videos of examples of regression modeling. It includes extensive built-in
documentation and pop-up teaching notes as well as some novel features to
support systematic grading and auditing of student work on a large scale. There
is a separate logistic
regression version with
highly interactive tables and charts that runs on PC's. RegressIt also now
includes a two-way
interface with R that allows
you to run linear and logistic regression models in R without writing any code
whatsoever.
If you have
been using Excel's own Data Analysis add-in for regression (Analysis Toolpak),
this is the time to stop. It has not
changed since it was first introduced in 1993, and it was a poor design even
then. It's a toy (a clumsy one at that), not a tool for serious work. Visit
this page for a discussion: What's wrong with Excel's Analysis Toolpak for regression
Percent
of variance explained vs. percent of standard deviation explained
An example in
which R-squared is a poor guide to analysis
Guidelines for
interpreting R-squared
The
question is often asked: "what's a good value for R-squared?" or
“how big does R-squared need to be for the regression model to be
valid?” Sometimes the claim
is even made: "a model is not useful unless its R-squared is at least
x", where x may be some fraction greater than 50%. The correct response to this question is
polite laughter followed by: "That depends!" A former student of mine landed a job at
a top consulting firm by being the only candidate who gave that answer during
his interview.
R-squared is
the “percent of variance explained” by the model. That is, R-squared is the fraction by which the variance of the errors is less
than the variance of the dependent variable. (The latter number would be the
error variance for a constant-only model, which merely predicts that every
observation will equal the sample mean.)
It is called R-squared because in a simple regression model it is just the square of the correlation between
the dependent and independent variables, which is commonly denoted by
“r”. In a multiple regression model R-squared is
determined by pairwise correlations among all
the variables, including correlations of the independent variables with each
other as well as with the dependent variable. In the latter setting, the square root of
R-squared is known as “multiple R”, and it is equal to the
correlation between the dependent variable and the regression model’s
predictions for it. (Note: if the
model does not include a constant, which is a so-called “regression
through the origin”, then R-squared has a different definition. See this page for more
details. You cannot compare R-squared
between a model that includes a constant and one that does not.)
Generally it is better to look
at adjusted
R-squared rather than R-squared and to look at the standard error of the regression
rather than the standard deviation of the errors. These are unbiased estimators that correct for the sample size and numbers of
coefficients estimated. Adjusted R-squared is always smaller than R-squared,
but the difference is usually very small unless you are trying to estimate too
many coefficients from too small a sample in the presence of too much noise.
Specifically, adjusted R-squared is
equal to 1 minus (n - 1)/(n – k - 1) times
1-minus-R-squared, where n is the sample size and k is the number of
independent variables. (It is
possible that adjusted R-squared is negative if the model is too complex
for the sample size and/or the independent variables have too little predictive
value, and some software just reports that adjusted R-squared is zero in that
case.) Adjusted R-squared bears the same relation to the standard error of the
regression that R-squared bears to the standard deviation of the errors: one
necessarily goes up when the other goes down for models fitted to the same
sample of the same dependent variable.
Now, what is the relevant variance that requires
explanation, and how much or how little explanation is necessary or useful? There is a huge range of
applications for linear regression analysis in science, medicine, engineering,
economics, finance, marketing, manufacturing, sports, etc.. In some situations
the variables under consideration have very strong and intuitively obvious
relationships, while in other situations you may be looking for very weak
signals in very noisy data. The
decisions that depend on the analysis could have either narrow or wide margins
for prediction error, and the stakes could be small or large. For example, in medical research,
a new drug treatment might have highly variable effects on individual patients,
in comparison to alternative treatments, and yet have statistically significant
benefits in an experimental study of thousands of subjects. That is to say, the amount of variance
explained when predicting individual outcomes could be small, and yet the
estimates of the coefficients that measure the drug’s effects could be
significantly different from zero (as measured by low P-values) in a large
sample. A result like this could
save many lives over the long run and be worth millions of dollars in profits
if it results in the drug’s approval for widespread use.
Even in the
context of a single statistical decision problem, there may be many ways to
frame the analysis, resulting in different standards and expectations for the
amount of variance to be explained in the linear regression stage. We have seen by now that there are many transformations
that may be applied to a variable before it is used as a dependent variable in
a regression model: deflation, logging, seasonal adjustment, differencing. All
of these transformations will change the variance and may also change the units
in which variance is measured. Logging completely changes the the units of measurement:
roughly speaking, the error measures become percentages rather than absolute
amounts, as explained here.
Deflation and seasonal adjustment also change the units of measurement, and
differencing usually reduces the variance dramatically when applied to
nonstationary time series data. Therefore, if the dependent variable in the
regression model has already been transformed in some way, it is possible that
much of the variance has already been "explained" merely by that
process. With respect to which
variance should improvement be measured in such cases: that of the original
series, the deflated series, the seasonally adjusted series, the differenced
series, or the logged series? You cannot meaningfully compare R-squared
between models that have used different transformations of the dependent
variable, as the example below will illustrate.
Moreover,
variance is a hard quantity to think about because it is measured in squared units (dollars squared, beer
cans squared….). It is easier
to think in terms of standard deviations,
because they are measured in the same units as the variables and they directly
determine the widths of confidence intervals. So, it is instructive to also consider
the “percent of standard deviation
explained,” i.e., the percent by which the standard deviation of the
errors is less than the standard deviation of the dependent variable. This is equal to one minus the square root of 1-minus-R-squared. Here is a table that shows the
conversion:
For example,
if the model’s R-squared is 90%, the variance of its errors is 90% less
than the variance of the dependent variable and the standard deviation of its
errors is 68% less than the standard deviation of the dependent variable. That is, the standard deviation of the
regression model’s errors is about 1/3 the size of the standard deviation
of the errors that you would get with a constant-only model. That’s very good, but it
doesn’t sound quite as impressive as “NINETY PERCENT
EXPLAINED!”.
If the
model’s R-squared is 75%, the standard deviation of the errors is exactly
one-half of the standard deviation of the dependent variable. Now, suppose that the addition of
another variable or two to this model increases R-squared to 76%. That’s better, right? Well, by the
formula above, this increases the percent of standard deviation explained from
50% to 51%, which means the standard deviation of the errors is reduced from
50% of that of the constant-only model to 49%, a shrinkage of 2% in relative
terms. Confidence intervals for
forecasts produced by the second model would therefore be about 2% narrower
than those of the first model, on average, not enough to notice on a
graph. You should ask
yourself: is that worth the
increase in model complexity?
An increase
in R-squared from 75% to 80% would reduce the error standard deviation by about
10% in relative terms. That begins
to rise to the level of a perceptible reduction in the widths of confidence
intervals. But don’t forget, confidence intervals are realistic guides to
the accuracy of predictions only if the
model’s assumptions are correct.
When adding more variables
to a model, you need to think about the cause-and-effect assumptions that
implicitly go with them, and you should also look at how their addition changes
the estimated coefficients of other variables. Do they become easier to explain, or
harder? And do the residual stats
and plots indicate that the model’s assumptions are OK? If they aren’t, then you
shouldn’t be obsessing over small improvements in R-squared anyway. Your problems lie elsewhere.
Another
handy rule of thumb: for small values
(R-squared less than 25%), the percent of standard deviation explained is
roughly one-half of the percent of variance explained. So, for example, a
model with an R-squared of 10% yields errors that are 5% smaller than those of
a constant-only model, on average.
How big an R-squared is “big
enough”, or cause for celebration or despair? That depends on the decision-making
situation, and it depends on your objectives or needs, and it depends on how
the dependent variable is defined.
In some situations it might be reasonable to hope and expect to explain
99% of the variance, or equivalently 90% of the standard deviation of the
dependent variable. In other cases,
you might consider yourself to be doing very well if you explained 10% of the
variance, or equivalently 5% of the standard deviation, or perhaps even
less. The following section gives
an example that highlights these issues.
If you want to skip the example and go straight to the concluding
comments, click here.
An example in which
R-squared is a poor guide to analysis:
Consider the U.S. monthly auto sales series that was used for
illustration in the first chapter of these notes, whose graph is reproduced
here:
The units
are $billions and the date range shown here is from January 1970 to February
1996. Suppose that the objective of
the analysis is to predict monthly auto sales from monthly total personal
income. I am using these variables
(and this antiquated date range) for two reasons: (i) this very (silly) example was used
to illustrate the benefits of regression analysis in a textbook that I was
using in that era, and (ii) I have seen many students undertake self-designed
forecasting projects in which they have blindly fitted regression models using
macroeconomic indicators such as personal income, gross domestic product,
unemployment, and stock prices as predictors of nearly everything, the logic being that they reflect the
general state of the economy and therefore have implications for every kind of
business activity. Perhaps so, but
the question is whether they do it in a linear,
additive fashion that stands out against the background noise in the
variable that is to be predicted, and whether they adequately explain time patterns in the data, and whether they
yield useful predictions and
inferences in comparison to other ways in which you might choose to spend your
time. Return to top
of page.
The
corresponding graph of personal income (also in $billions) looks like this:
There is no
seasonality in the income data. In
fact, there is almost no pattern in it at all except for a trend that increased
slightly in the earlier years. (This
is not a good sign if we hope to get forecasts that have any specificity.) By comparison, the seasonal
pattern is the most striking feature in the auto sales, so the first thing that
needs to be done is to seasonally adjust
the latter. Seasonally adjusted
auto sales (independently obtained from the same government source) and
personal income line up like this when plotted on the same graph:
The strong and
generally similar-looking trends suggest that we will get a very high value of
R-squared if we regress sales on income, and indeed we do. Here is the summary table for that
regression:
Adjusted R-squared is
almost 97%! However, a result like
this is to be expected when regressing a strongly trended series on any other strongly trended series, regardless
of whether they are logically related.
Here are the line fit plot and residuals-vs-time plot for the model:
The residual-vs-time
plot indicates that the model has some terrible problems. First, there is very
strong positive autocorrelation in the
errors, i.e., a tendency to make the same error many times in a row. In fact, the lag-1 autocorrelation is
0.77 for this model. It is clear
why this happens: the two curves do
not have exactly the same shape.
The trend in the auto sales series tends to vary over time while the
trend in income is much more consistent, so the two variales get out-of-synch
with each other. This is typical of
nonstationary time series data. Second, the
model’s largest errors have occurred in the more recent years and
especially in the last few months (at the “business end” of the
data, as I like to say), which means that we should expect the next few errors
to be huge too, given the strong positive correlation between consecutive
errors. And finally, the local variance of the errors increases
steadily over time. The reason for this is that random variations in auto
sales (like most other measures of macroeconomic activity) tend to be
consistent over time in percentage
terms rather than absolute terms, and the absolute level of the series has
risen dramatically due to a combination of inflationary growth and real
growth. As the level as grown, the
variance of the random fluctuations has grown with it. Confidence intervals for forecasts in
the near future will therefore be way too narrow, being based on average error
sizes over the whole history of the series. So, despite the high value of
R-squared, this is a very bad
model. Return to top
of page.
One way to try to
improve the model would be to deflate both
series first. This would at least
eliminate the inflationary component of growth, which hopefully will make the
variance of the errors more consistent over time. Here is a time series plot showing auto
sales and personal income after they have been deflated by dividing them by the
U.S. all-product consumer price index (CPI) at each point in time, with the CPI
normalized to a value of 1.0 in February 1996 (the last row of the data). This does indeed flatten out the trend
somewhat, and it also brings out some fine detail in the month-to-month
variations that was not so apparent on the original plot. In particular, we begin to see some
small bumps and wiggles in the income data that roughly line up with larger
bumps and wiggles in the auto sales data.
If we fit a simple
regression model to these two variables, the following results are obtained:
Adjusted R-squared is
only 0.788 for this model, which is worse, right? Well, no. We “explained” some of the variance
in the original data by deflating it prior to fitting this model. Because the dependent variables are not
the same, it is not appropriate to do a head-to-head comparison of R-squared. Arguably this is a better model, because
it separates out the real growth in sales from the inflationary growth, and
also because the errors have a more consistent variance over time. (The latter issue is not the bottom
line, but it is a step in the direction of fixing the model assumptions.) Most interestingly, the deflated income
data shows some fine detail that matches up with similar patterns in the sales
data. However, the error variance
is still a long way from being constant over the full two-and-a-half decades, and
the problems of badly autocorrelated errors and a particularly bad fit to the
most recent data have not been solved.
Another
statistic that we might be tempted to compare between these two models is the
standard error of the regression, which normally is the best bottom-line
statistic to focus on. The second
model’s standard error is much larger: 3.253 vs. 2.218 for the first
model. But wait… these two numbers cannot be directly
compared, either, because they are not measured in the same units. The standard error of the first model is
measured in units of current dollars,
while the standard error of the second model is measured in units of 1996 dollars. Those were decades of high inflation,
and 1996 dollars were not worth nearly as much as dollars were worth in the
earlier years. (In fact, a 1996
dollar was only worth about one-quarter of a 1970 dollar.) Return to top of page.
The slope
coefficients in the two models are also of interest. Because the units of the dependent
and independent variables are the same in each model (current dollars in the
first model, 1996 dollars in the second model), the slope coefficient can be interpreted as the predicted increase in
dollars spent on autos per dollar of increase in income. The slope coefficients in the two models
are nearly identical: 0.086 and
0.087, implying that on the margin, 8.6% to 8.7% of additional income is spent
on autos.
Let’s now try
something totally different: fitting a simple time series model to the deflated
data. In particular, let’s
fit a random-walk-with-drift model,
which is logically equivalent to fitting a constant-only model to the first difference (period to period
change) in the original series. Let
the differenced series be called AUTOSALES_SADJ_1996_DOLLARS_DIFF1 (which is
the name that would be automatically assigned in RegressIt). Notice that we are now 3 levels deep in
data transformations: seasonal
adjustment, deflation, and differencing!
This sort of situation is very common in time series analysis. Here are the results of fitting this
model, in which AUTOSALES_SADJ_1996_DOLLARS_DIFF1 is the dependent variables
and there are no independent variables, just the constant. This model merely predicts that each
monthly difference will be the same, i.e., it predicts constant growth relative
to the previous month’s value.
Adjusted R-squared
has dropped to zero! This is not a
problem: a constant-only regression always has an R-squared of zero, but that
doesn’t necessarily imply that it is not a good model for the particular
dependent variable that has been used. We should look instead at the
standard error of the regression.
The units and sample of the dependent variable are the same for this
model as for the previous one, so their regression standard errors can be
legitimately compared. (The
sample size for the second model is actually 1 less than that of the first
model due to the lack of period-zero value for computing a period-1 difference,
but this is insignificant in such a large data set.) The regression standard error of this
model is only 2.111, compared to 3.253 for the previous one, a reduction of
roughly one-third, which is a very significant improvement. (The residual-vs-time plot for this
model and the previous one have the same vertical scaling: look at them both and compare the size
of the errors, particularly those that have occurred recently.) The reason why this model’s
forecasts are so much more accurate is that it looks at last month’s actual sales values, whereas the previous model
only looked at personal income data.
It is often the case that the best
information about where a time series is going to go next is where it has been
lately.
There is no line fit
plot for this model, because there is no independent variable, but here is the
residual-versus-time plot:
These residuals look
quite random to the naked eye, but they actually exhibit negative autocorrelation, i.e., a tendency to alternate between
overprediction and underprediction from one month to the next. (The lag-1 autocorrelation here is
-0.356.) This often happens when
differenced data is used, but overall the errors of this model are much closer
to being independently and identically distributed than those of the previous
two, so we can have a good deal more confidence in any confidence intervals for
forecasts that may be computed from it.
Of course, this model does not shed light on the relationship between
personal income and auto sales.
So, what is the
relationship between auto sales and personal income? That is a complex question and it will
not be further pursued here except to note that there some other simple things
we could do besides fitting a regression model. For example, we could compute the percentage of income spent on automobiles
over time, i.e., just divide the auto sales series by the personal income
series and see what the pattern looks like. Here is the resulting picture:
This chart nicely
illustrates cyclical variations in the fraction of income spent on autos, which
would be interesting to try to match up with other explanatory variables. The range is from about 7% to about 10%,
which is generally consistent with the slope coefficients that were obtained in
the two regression models (8.6% and 8.7%).
However, this chart re-emphasizes what was seen in the residual-vs-time
charts for the simple regression models:
the fraction of income spent on autos is not consistent over time. In particular, notice that the fraction
was increasing toward the end of the sample, exceeding 10% in the last month.
The bottom line here
is that R-squared was not of any use in
guiding us through this particular analysis toward better and better models. In fact, among the models considered
above, the worst one had an R-squared of 97% and the best one had an R-squared
of zero. At various stages of the
analysis, data transformations were suggested: seasonal adjustment, deflating,
differencing. (Logging was not
tried here, but would have been an alternative to deflation.) And every time the dependent variable is
transformed, it becomes impossible to make meaningful before-and-after
comparisons of R-squared.
Furthermore, regression was probably not even the best tool to use here
in order to study the relation between the two variables. It is not a “universal
wrench” that should be used on every problem. Return to top of page.
So,
what IS a good value for R-squared?
It
depends on the variable with respect to which you measure it, it depends on the
units in which that variable is measured and whether any data transformations
have been applied, and it depends on the decision-making context. If the dependent variable is a
nonstationary (e.g., trending or random-walking) time series, an R-squared
value very close to 1 (such as the 97% figure obtained in the first model
above) may not be very impressive.
In fact, if R-squared is very close to 1, and the data consists of time
series, this is usually a bad sign rather than a good one: there will often be significant time
patterns in the errors, as in the example above. On the other hand, if the dependent
variable is a properly stationarized series (e.g., differences or percentage
differences rather than levels), then an R-squared of 25% may be quite good. In
fact, an R-squared of 10% or even less could have some information value when
you are looking for a weak signal in the presence of a lot of noise in a
setting where even a very weak one
would be of general interest. Sometimes there is a lot of value in explaining
only a very small fraction of the variance, and sometimes there isn't. Data
transformations such as logging or deflating also change the interpretation and
standards for R-squared, inasmuch as they change the variance you start out
with.
However, be very careful when evaluating a model
with a low value of R-squared.
In such a situation: (i) it
is better if the set of variables in the model is determined a priori (as in
the case of a designed experiment or a test of a well-posed hypothesis) rather
by searching among a lineup of randomly selected suspects; (ii) the data should
be clean (not contaminated by outliers, inconsistent measurements, or
ambiguities in what is being measured, as in the case of poorly worded surveys
given to unmotivated subjects); (iii) the coefficient estimates should be
individually or at least jointly significantly different from zero (as measured
by their P-values and/or the P-value of the F statistic), which may require a
large sample to achieve in the presence of low correlations; and (iv) it is a
good idea to do cross-validation
(out-of-sample testing) to see if the model performs about equally well on data
that was not used to identify or estimate it, particularly when the structure
of the model was not known a priori.
It is easy to find spurious (accidental) correlations if you go on a
fishing expedition in a large pool of candidate independent variables while
using low standards for acceptance.
I have often had students use this approach to try to predict stock
returns using regression models--which I do not recommend--and it is not
uncommon for them to find models that yield R-squared values in the range of 5%
to 10%, but they virtually never survive out-of-sample testing. (You should buy index funds
instead.)
There are a variety
of ways in which to cross-validate a model. A discussion of some of them can be
found here. If your software doesn’t offer
such options, there are simple tests you can conduct on your own. One is to split the data set in half and
fit the model separately to both halves to see if you get similar results in
terms of coefficient estimates and adjusted R-squared.
When working with time series data, if you compare the
standard deviation of the errors of a regression model which uses exogenous
predictors against that of a simple time series model (say, an autoregressive
or exponential smoothing or random walk model), you may be disappointed by what
you find. If the variable to be
predicted is a time series, it will often be the case that most of the
predictive power is derived from its own history via lags, differences, and/or
seasonal adjustment. This is the reason why we spent some time studying the
properties of time series models before tackling regression models.
A rule of thumb for small values of R-squared: If R-squared is small (say 25% or less),
then the fraction by which the standard deviation of the errors is less than
the standard deviation of the dependent variable is approximately one-half of R-squared, as shown in the table above. So,
for example, if your model has an R-squared of 10%, then its errors are only
about 5% smaller on average than those of a constant-only model, which merely
predicts that everything will equal the mean. Is that enough to be useful, or
not? Another handy reference point: if the model has an R-squared of 75%,
its errors are 50% smaller on average than those of a constant-only model.
(This is not an approximation: it
follows directly from the fact that reducing the error standard deviation to
½ of its former value is equivalent to reducing its variance to ¼
of its former value.)
In general you should
look at adjusted R-squared rather than
R-squared. Adjusted R-squared
is an unbiased estimate of the
fraction of variance explained, taking into account the sample size and number
of variables. Usually adjusted
R-squared is only slightly smaller than R-squared, but it is possible for
adjusted R-squared to be zero or negative if a model with insufficiently
informative variables is fitted to too small a sample of data.
What measure of your
model's explanatory power should you report to your boss or client or
instructor? If
you used regression analysis, then to be perfectly candid you should of course
include the adjusted R-squared for the regression model that was actually
fitted (whether to the original data or some transformation thereof), along
with other details of the output, somewhere in your report. You should more strongly emphasize the standard error of the regression,
though, because that measures the predictive accuracy of the model in real
terms, and it scales the width of all confidence intervals calculated from the
model. You may also want to report
other practical measures of error size such as the mean absolute error or mean
absolute percentage error and/or mean absolute scaled error.
What should never
happen to you: Don't
ever let yourself fall into the trap of fitting (and then promoting!) a
regression model that has a respectable-looking R-squared but is actually very
much inferior to a simple time series model. If the dependent variable in your
model is a nonstationary time series, be sure that you do a comparison of error
measures against an appropriate time series model. Remember that what R-squared measures is
the proportional reduction in error variance that the regression model achieves in comparison to a constant-only model
(i.e., mean model) fitted to the same dependent variable, but the
constant-only model may not be the most appropriate reference point, and the
dependent variable you end up using may not be the one you started with if data
transformations turn out to be important.
And finally: R-squared is not the bottom line. You
don’t get paid in proportion to R-squared. The real bottom line in your analysis is
measured by consequences of decisions that you and others will make on the
basis of it. In general, the
important criteria for a good regression model are (a) to make the smallest
possible errors, in practical terms,
when predicting what will happen in the
future, and (b) to derive useful inferences from the structure of the model
and the estimated values of its parameters. Return to top of page.
Go on to next topic: How to compare models