Notes on linear
regression analysis (pdf file)
Introduction
to linear regression analysis
Mathematics
of simple regression
Regression examples
·
Beer sales vs. price, part 1: descriptive
analysis
·
Beer sales vs. price, part 2: fitting a simple
model
·
Beer sales vs. price, part 3: transformations
of variables
·
Beer sales vs.
price, part 4: additional predictors
·
NC natural gas
consumption vs. temperature
·
More regression datasets
at regressit.com
What to look for in
regression output
What’s a good
value for R-squared?
What's the bottom line? How to compare models
Testing the
assumptions of linear regression
Additional notes on regression analysis
Stepwise and all-possible-regressions
Excel file with
simple regression formulas
Excel file with regression
formulas in matrix form
Notes on logistic regression (new!)
If you use
Excel in your work or in your teaching to any extent, you should check out the
latest release of RegressIt, a free Excel add-in for linear and logistic
regression. See it at regressit.com. The linear regression version runs on both PC's and Macs and
has a richer and easier-to-use interface and much better designed output than
other add-ins for statistical analysis. It may make a good complement if not a
substitute for whatever regression software you are currently using,
Excel-based or otherwise. RegressIt is an excellent tool for
interactive presentations, online teaching of regression, and development of
videos of examples of regression modeling. It includes extensive built-in
documentation and pop-up teaching notes as well as some novel features to
support systematic grading and auditing of student work on a large scale. There
is a separate logistic
regression version with
highly interactive tables and charts that runs on PC's. RegressIt also now
includes a two-way
interface with R that allows
you to run linear and logistic regression models in R without writing any code
whatsoever.
If you have
been using Excel's own Data Analysis add-in for regression (Analysis Toolpak),
this is the time to stop. It has not
changed since it was first introduced in 1993, and it was a poor design even
then. It's a toy (a clumsy one at that), not a tool for serious work. Visit
this page for a discussion: What's wrong with Excel's Analysis Toolpak for regression
Four assumptions of
regression
Testing for linear and
additivity of predictive relationships
Testing for
independence (lack of correlation) of errors
Testing for
homoscedasticity (constant variance) of errors
Testing for normality of
the error distribution
There
are four principal assumptions which
justify the use of linear regression models for purposes of inference or
prediction:
(i)
linearity and additivity of the relationship
between dependent and independent variables:
(a) The expected value of dependent variable is a straight-line function of
each independent variable, holding the others fixed.
(b) The slope of that line does not depend on the values of the other
variables.
(c) The effects of different
independent variables on the expected value of the dependent variable are
additive.
(ii)
statistical independence of the errors (in particular, no correlation between
consecutive errors in the case of time series data)
(iii)
homoscedasticity
(constant variance) of the errors
(a) versus time (in the case of time series data)
(b) versus the predictions
(c) versus any independent
variable
(iv)
normality
of the error distribution.
If any of
these assumptions is violated (i.e., if there are nonlinear relationships
between dependent and independent variables or the errors exhibit correlation,
heteroscedasticity, or non-normality), then the forecasts, confidence
intervals, and scientific insights yielded by a regression model may be (at
best) inefficient or (at worst) seriously biased or misleading. More details of these assumptions, and
the justification for them (or not) in particular cases, is given on the introduction to regression
page.
Ideally your
statistical software will automatically provide charts and statistics that test
whether these assumptions are satisfied for any given model. Unfortunately, many software packages do
not provide such output by default (additional menu commands must be executed
or code must be written) and some (such as Excel’s built-in regression
add-in) offer only limited options.
RegressIt does provide such output and in graphic detail. See this page
for an example of output from a model that violates all of the assumptions
above, yet is likely to be accepted by a naïve user on the basis of a
large value of R-squared, and see this page for an
example of a model that satisfies the assumptions reasonably well, which is
obtained from the first one by a nonlinear transformation of variables. The normal quantile plots from those
models are also shown at the bottom of this page.
You will
sometimes see additional (or different) assumptions listed, such as “the
variables are measured accurately” or “the sample is representative
of the population”, etc.
These are important considerations in any form of statistical modeling,
and they should be given due attention, although they do not refer to
properties of the linear regression equation per se. (Return to top of
page.)
Violations
of linearity or additivity are extremely serious: if you fit a linear model to data
which are nonlinearly or nonadditively related, your predictions are likely to
be seriously in error, especially when you extrapolate beyond the range of the
sample data.
How to
diagnose:
nonlinearity is usually most evident in a plot of observed versus predicted values
or a plot of residuals versus predicted values, which are a part of
standard regression output. The points should be symmetrically distributed
around a diagonal line in the former plot or around horizontal line in the
latter plot, with a roughly constant variance. (The residual-versus-predicted-plot is
better than the observed-versus-predicted plot for this purpose, because it
eliminates the visual distraction of a sloping pattern.) Look carefully for evidence of a
"bowed" pattern, indicating that the model makes systematic errors
whenever it is making unusually large or small predictions. In multiple
regression models, nonlinearity or nonadditivity may also be revealed by
systematic patterns in plots of the residuals
versus individual independent variables.
How to
fix:
consider applying a nonlinear transformation to the dependent and/or
independent variables if you can
think of a transformation that seems appropriate. (Don’t just make
something up!) For example, if the data are strictly positive, the log transformation
is an option. (The logarithm base
does not matter--all log functions are same up to linear scaling--although the
natural log is usually preferred because small changes in the natural log are
equivalent to percentage changes.
See these notes
for more details.) If a log
transformation is applied to the dependent variable only, this is equivalent to
assuming that it grows (or decays) exponentially as a function of the independent
variables. If a log transformation
is applied to both the dependent
variable and the independent variables, this is equivalent to assuming that the
effects of the independent variables are multiplicative
rather than additive in their original units. This means that, on the margin, a
small percentage change in one of the
independent variables induces a proportional percentage change in the expected value of the dependent variable,
other things being equal. Models of
this kind are commonly used in modeling price-demand relationships, as
illustrated on the beer sales
example on this web site.
Another
possibility to consider is adding another
regressor that is a nonlinear function of one of the other variables. For
example, if you have regressed Y on X, and the graph of residuals versus
predicted values suggests a parabolic curve, then it may make sense to regress
Y on both X and X^2 (i.e., X-squared). The latter transformation is possible even
when X and/or Y have negative values, whereas logging is not. Higher-order terms of this kind (cubic,
etc.) might also be considered in some cases. But don’t get carried away! This sort of "polynomial curve
fitting" can be a nice way to draw a smooth curve through a wavy pattern
of points (in fact, it is a trend-line option on scatterplots on Excel), but it
is usually a terrible way to extrapolate outside the range of the sample
data.
Finally, it
may be that you have overlooked some entirely
different independent variable that explains or corrects for the nonlinear
pattern or interactions among variables that you are seeing in your residual
plots. In that case the shape of the pattern, together with economic or
physical reasoning, may suggest some likely suspects. For example, if the strength of the
linear relationship between Y and X1 depends on the level of some
other variable X2, this could perhaps be addressed by creating a new
independent variable that is the product of X1 and X2. In the case of time series data, if the
trend in Y is believed to have changed at a particular point in time, then the
addition of a piecewise linear trend
variable (one whose string of values looks like 0, 0, …, 0, 1, 2, 3, … ) could be used to fit the kink in the data. Such a variable can be considered as the
product of a trend variable and a dummy variable. Again, though, you need to beware of
overfitting the sample data by throwing in artificially constructed variables
that are poorly motivated. At the
end of the day you need to be able to interpret the model and explain (or sell)
it to others. (Return
to top of page.)
Violations
of independence
are potentially very serious in time series regression models: serial
correlation in the errors (i.e., correlation between consecutive errors or errors
separated by some other number of periods) means that there is room for
improvement in the model, and extreme serial correlation is often a symptom of
a badly mis-specified model. Serial correlation (also known as
autocorrelation”) is sometimes a byproduct of a violation of the
linearity assumption, as in the case of a simple (i.e., straight) trend line
fitted to data which are growing exponentially over time.
Independence
can also be violated in non-time-series models if errors tend to always have
the same sign under particular conditions, i.e., if the model systematically
underpredicts or overpredicts what will happen when the independent variables
have a particular configuration.
How to
diagnose:
The best test for serial correlation is to look at a residual time series plot (residuals vs. row number) and a table or plot of residual autocorrelations.
(If your software does not provide these by default for time series data, you
should figure out where in the menu or code to find them.) Ideally, most of the
residual autocorrelations should fall within the 95% confidence bands around
zero, which are located at roughly plus-or-minus 2-over-the-square-root-of-n,
where n is the sample size. Thus, if the sample size is 50, the
autocorrelations should be between +/- 0.3. If the sample size is 100, they
should be between +/- 0.2. Pay especially close attention to significant
correlations at the first couple of lags and in the vicinity of the seasonal
period, because these are probably not due to mere chance and are also fixable.
The Durbin-Watson statistic provides a test for significant residual
autocorrelation at lag 1: the DW stat is approximately equal to 2(1-a) where a
is the lag-1 residual autocorrelation, so ideally it should be close to
2.0--say, between 1.4 and 2.6 for a sample size of 50.
How to
fix:
Minor cases of positive serial correlation (say, lag-1 residual
autocorrelation in the range 0.2 to 0.4, or a Durbin-Watson statistic between
1.2 and 1.6) indicate that there is some room for fine-tuning in the model.
Consider adding lags of the dependent variable and/or lags of some of the
independent variables. Or, if you have an ARIMA+regressor procedure available
in your statistical software, try adding an AR(1) or
MA(1) term to the regression model.
An AR(1) term adds a lag of the dependent
variable to the forecasting equation, whereas an MA(1) term adds a lag of the
forecast error. If there is significant correlation at lag 2, then a 2nd-order
lag may be appropriate.
If there is
significant negative correlation in the residuals (lag-1 autocorrelation
more negative than -0.3 or DW stat greater than 2.6), watch out for the
possibility that you may have overdifferenced some of your variables.
Differencing tends to drive autocorrelations in the negative direction, and too
much differencing may lead to artificial patterns of negative correlation that
lagged variables cannot correct for.
If there is
significant correlation at the seasonal period (e.g. at lag 4 for
quarterly data or lag 12 for monthly data), this indicates that seasonality has
not been properly accounted for in the model. Seasonality can be handled in a
regression model in one of the following ways: (i) seasonally adjust the
variables (if they are not already seasonally adjusted), or (ii) use seasonal
lags and/or seasonally differenced variables (caution: be careful not to
overdifference!), or (iii) add seasonal dummy variables to the model
(i.e., indicator variables for different seasons of the year, such as MONTH=1
or QUARTER=2, etc.) The dummy-variable approach enables additive seasonal
adjustment to be performed as part of the regression model: a different
additive constant can be estimated for each season of the year. If the
dependent variable has been logged, the seasonal adjustment is multiplicative.
(Something else to watch out for: it is possible that although your dependent
variable is already seasonally adjusted, some of your independent variables may
not be, causing their seasonal patterns to leak into the forecasts.)
Major
cases
of serial correlation (a Durbin-Watson statistic well below 1.0,
autocorrelations well above 0.5) usually indicate a fundamental structural
problem in the model. You may wish to reconsider the transformations (if any)
that have been applied to the dependent and independent variables. It may help
to stationarize all variables through appropriate combinations of differencing,
logging, and/or deflating.
To test for non-time-series violations of independence,
you can look at plots of the residuals versus independent variables or plots of
residuals versus row number in situations where the rows have been sorted or
grouped in some way that depends (only) on the values of the independent
variables. The residuals should be
randomly and symmetrically distributed around zero under all conditions, and in
particular there should be no
correlation between consecutive errors no matter how the rows are sorted,
as long as it is on some criterion that does not involve the dependent
variable. If this is not true, it
could be due to a violation of the linearity assumption or due to bias that is
explainable by omitted variables (say, interaction terms or dummies for
identifiable conditions).
Violations of homoscedasticity (which are called
"heteroscedasticity") make it difficult to gauge the true standard
deviation of the forecast errors, usually resulting in confidence intervals
that are too wide or too narrow. In particular, if the variance of the errors
is increasing over time, confidence intervals for out-of-sample predictions
will tend to be unrealistically narrow. Heteroscedasticity may also have the
effect of giving too much weight to a small subset of the data (namely the
subset where the error variance was largest) when estimating coefficients.
How to
diagnose:
look at a plot of residuals versus predicted values and, in the case of time series data, a
plot of residuals versus time. Be alert for evidence of residuals that
grow larger either as a function of time or as a function of the predicted
value. To be really thorough, you should also generate plots of residuals versus independent variables to look for consistency there
as well. Because of imprecision in
the coefficient estimates, the errors may tend to be slightly larger for forecasts associated with predictions or values
of independent variables that are extreme in both directions, although the
effect should not be too dramatic.
What you hope not to see are
errors that systematically get larger in one direction by a significant amount.
How to
fix: If the dependent variable is strictly positive and if the
residual-versus-predicted plot shows that the size of the errors is
proportional to the size of the predictions (i.e., if the errors seem
consistent in percentage rather than absolute terms), a log transformation
applied to the dependent variable may be appropriate. In time series models, heteroscedasticity often
arises due to the effects of inflation and/or real compound growth. Some
combination of logging and/or deflating will often stabilize the
variance in this case. Stock market data may show periods of increased or decreased
volatility over time. This is normal and is often modeled with so-called ARCH
(auto-regressive conditional heteroscedasticity) models in which the error
variance is fitted by an autoregressive model. Such models are beyond the scope
of this discussion, but a simple fix would be to work with shorter intervals of
data in which volatility is more nearly constant. Heteroscedasticity can also
be a byproduct of a significant violation of the linearity and/or independence
assumptions, in which case it may also be fixed as a byproduct of fixing those
problem.
Seasonal patterns in the data are a
common source of heteroscedasticity in the errors: unexplained variations in the dependent
variable throughout the course of a season may be consistent in percentage
rather than absolute terms, in which case larger errors will be made in seasons
where activity is greater, which will show up as a seasonal pattern of changing
variance on the residual-vs-time plot.
A log transformation is often used to address this problem. For example, if the seasonal pattern is
being modeled through the use of dummy variables for months or quarters of the
year, a log transformation applied to the dependent variable will convert the
coefficients of the dummy variables to multiplicative adjustment factors rather
than additive adjustment factors, and the errors in predicting the logged
variable will be (roughly) interpretable as percentage errors in predicting the
original variable. Seasonal
adjustment of all the data prior to fitting the regression model might be
another option.
If a log
transformation has already been applied to a variable, then (as noted above) additive rather than multiplicative
seasonal adjustment should be used, if it is an option that your software
offers. Additive seasonal
adjustment is similar in principle to including dummy variables for seasons of
the year. Whether-or-not you should
perform the adjustment outside the model rather than with dummies depends on whether
you want to be able to study the seasonally adjusted data all by itself and on
whether there are unadjusted seasonal patterns in some of the independent
variables. (The dummy-variable
approach would address the latter problem.) (Return to top of page.)
Violations
of normality
create problems for determining whether model coefficients are significantly
different from zero and for calculating confidence intervals for forecasts.
Sometimes the error distribution is "skewed" by the presence of a few
large outliers. Since parameter estimation is based on the minimization of
squared error, a few extreme observations can exert a disproportionate
influence on parameter estimates. Calculation of confidence intervals and
various significance tests for coefficients are all based on the assumptions of
normally distributed errors. If the error distribution is significantly
non-normal, confidence intervals may be too wide or too narrow.
Technically,
the normal distribution assumption is not necessary if you are willing to
assume the model equation is correct and your only goal is to estimate its
coefficients and generate predictions in such a way as to minimize mean squared
error. The formulas for estimating
coefficients require no more than that, and some references on regression
analysis do not list normally distributed errors among the key
assumptions. But generally
we are interested in making inferences about the model and/or estimating the
probability that a given forecast error will exceed some threshold in a
particular direction, in which case distributional assumptions are important. Also, a significant violation of the
normal distribution assumption is often a "red flag" indicating that
there is some other problem with the model assumptions and/or that there are a
few unusual data points that should be studied closely and/or that a better model
is still waiting out there somewhere.
How to
diagnose:
the best test for normally distributed errors is a normal probability plot
or normal quantile plot of the residuals. These are plots of
the fractiles of error distribution versus the fractiles of a normal distribution
having the same mean and variance. If the distribution is normal, the points on
such a plot should fall close to the diagonal reference line. A bow-shaped
pattern of deviations from the diagonal indicates that the residuals have
excessive skewness (i.e., they are not symmetrically distributed, with
too many large errors in one
direction). An S-shaped pattern of deviations indicates that the residuals have
excessive kurtosis--i.e., there are either too many or two few large
errors in both directions. Sometimes the problem is revealed to be
that there are a few data points on one or both ends that deviate significantly
from the reference line ("outliers"), in which case they should get
close attention.
There are
also a variety of statistical tests for
normality, including the Kolmogorov-Smirnov
test, the Shapiro-Wilk
test, the Jarque-Bera
test, and the Anderson-Darling
test. The Anderson-Darling test
(which is the one used by RegressIt) is generally considered to be the best,
because it is specific to the normal distribution (unlike the K-S test) and it
looks at the whole distribution rather than just the skewness and kurtosis
(like the J-B test). But all of
these tests are excessively "picky" in this author’s
opinion. Real data rarely has
errors that are perfectly normally distributed, and it may not be possible to
fit your data with a model whose errors do not violate the normality assumption
at the 0.05 level of significance.
It is usually better to focus more on violations of the other
assumptions and/or the influence of a few outliers (which may be mainly
responsible for violations of normality anyway) and to look at a normal
probability plot or normal quantile plot and draw your own conclusions about whether
the problem is serious and whether it is systematic.
Here is an
example of a bad-looking normal quantile plot (an S-shaped pattern with P=0 for
the A-D stat, indicating highly significant non-normality) from the beer sales analysis on this web
site:
…and
here is an example of a good-looking one (a linear pattern with P=0.5 for the
A-D stat, indicating no significant departure from normality):
How to
fix: violations
of normality often arise either because (a) the distributions of the
dependent and/or independent variables are themselves significantly non-normal,
and/or (b) the linearity assumption is violated. In such cases, a
nonlinear transformation of variables might cure both problems. In the case of
the two normal quantile plots above, the second model was obtained applying a
natural log transformation to the variables in the first one.
The
dependent and independent variables in a regression model do not need to be
normally distributed by themselves--only the prediction errors need to be
normally distributed. (In fact,
independent variables do not even need to be random, as in the case of trend or
dummy or treatment or pricing variables.)
But if the distributions of some of the variables that are random are extremely asymmetric or
long-tailed, it may be hard to fit them into a linear model whose errors will
be normally distributed, and explaining the shape of their distributions may be
an interesting topic all by itself.
Keep in mind that the normal error assumption is usually justified by
appeal to the central limit theorem, which holds in the case where many random
variations are added together. If
the underlying sources of randomness are not interacting additively, this
argument fails to hold.
Another
possibility is that there are two or more subsets
of the data having different statistical
properties, in which case separate models should be built, or else some
data should merely be excluded, provided that there is some a priori criterion
that can be applied to make this determination.
In some
cases, the problem with the error distribution is mainly due to one or two
very large errors. Such values should be scrutinized closely: are they genuine
(i.e., not the result of data entry errors), are they explainable, are similar
events likely to occur again in the future, and how influential are
they in your model-fitting results? If they are merely errors or if they can be
explained as unique events not likely to be repeated, then you may have cause
to remove them. In some cases, however, it may be that the extreme values in
the data provide the most useful information about values of some of the
coefficients and/or provide the most realistic guide to the magnitudes of
forecast errors. (Return
to top of page.)
Go on to next topic: Additional notes on regression analysis