Notes on linear
regression analysis (pdf file)

Introduction
to linear regression analysis

Mathematics
of simple regression

Regression examples

·
Beer sales vs. price, part 1: descriptive
analysis

·
Beer sales vs. price, part 2: fitting a simple
model

·
Beer sales vs. price, part 3: transformations
of variables

·
Beer sales vs.
price, part 4: additional predictors

·
NC natural gas
consumption vs. temperature

What to look for in
regression output

What’s a good
value for R-squared?

What's the bottom line? How to compare models

Testing the
assumptions of linear regression

Additional notes on regression analysis

Stepwise and all-possible-regressions

Excel file with
simple regression formulas

Excel file with regression
formulas in matrix form

*Latest
news:** If you are at least
a part-time user of Excel, you should check out the new release of RegressIt, a free add-in developed by
the author of this site. See it at http://regressit.com. The linear regression
version runs on both PC's and Macs and has a richer and easier-to-use interface
and much better designed output than other add-ins for statistical analysis. It
may make a good complement if not a substitute for whatever regression software
you are currently using, Excel-based or otherwise. (If you have been using Excel's
analysis toolpak for regression, this is the time to stop.) RegressIt now includes
a two-way interface with R
that allows you to run linear and logistic regression models in R without
writing any code whatsoever. It also includes extensive built-in documentation
and pop-up teaching notes. There is a separate logistic regression version with
interactive tables and charts that runs on PC's.*

Four assumptions of
regression

Testing for linear and additivity
of predictive relationships

Testing for
independence (lack of correlation) of errors

Testing for
homoscedasticity (constant variance) of errors

Testing for normality of
the error distribution

There
are **four principal assumptions** which
justify the use of linear regression models for purposes of inference or
prediction:

**(i) linearity**** and additivity** of the relationship
between dependent and independent variables:

(a) The expected value of dependent variable is a straight-line function of
each independent variable, holding the others fixed.

(b) The slope of that line does not depend on the values of the other
variables.

(c) The effects of different
independent variables on the expected value of the dependent variable are
additive.

**(ii) statistical independence** of the errors (in particular, no
correlation between consecutive errors in the case of time series data)

**(iii) homoscedasticity** (constant variance) of the errors

(a) versus time (in the case of time series data)

(b) versus the predictions

(c) versus
any independent variable

**(iv) normality** of the error distribution.

If any of
these assumptions is violated (i.e., if there are nonlinear relationships
between dependent and independent variables or the errors exhibit correlation,
heteroscedasticity, or non-normality), then the forecasts, confidence
intervals, and scientific insights yielded by a regression model may be (at
best) inefficient or (at worst) seriously biased or misleading. More details of these assumptions, and
the justification for them (or not) in particular cases, is given on the introduction to regression
page.

Ideally your
statistical software will automatically provide charts and statistics that test
whether these assumptions are satisfied for any given model. Unfortunately, many software packages do
not provide such output by default (additional menu commands must be executed
or code must be written) and some (such as Excel’s built-in regression
add-in) offer only limited options.
RegressIt does provide such output and in graphic detail. See this page
for an example of output from a model that violates all of the assumptions
above, yet is likely to be accepted by a naïve user on the basis of a
large value of R-squared, and see this page for an
example of a model that satisfies the assumptions reasonably well, which is
obtained from the first one by a nonlinear transformation of variables. The normal quantile plots from those
models are also shown at the bottom of this page.

You will
sometimes see additional (or different) assumptions listed, such as “the
variables are measured accurately” or “the sample is representative
of the population”, etc.
These are important considerations in any form of statistical modeling,
and they should be given due attention, although they do not refer to
properties of the linear regression equation per se. (Return to top of
page.)

**Violations
of linearity or additivity** are extremely serious: if you fit a linear model to data
which are nonlinearly or nonadditively related, your predictions are likely to
be seriously in error, especially when you extrapolate beyond the range of the
sample data.

**How to diagnose**: nonlinearity is
usually most evident in a plot of** observed versus predicted** **values**
or a plot of **residuals versus predicted values**, which are a part of
standard regression output. The points should be symmetrically distributed around
a diagonal line in the former plot or around horizontal line in the latter
plot, with a roughly constant variance.
(The residual-versus-predicted-plot is better than the
observed-versus-predicted plot for this purpose, because it eliminates the visual
distraction of a sloping pattern.)
Look carefully for evidence of a "bowed" pattern, indicating
that the model makes systematic errors whenever it is making unusually large or
small predictions. In multiple regression models, nonlinearity or nonadditivity
may also be revealed by systematic patterns in plots of the **residuals versus individual independent
variables**.

**How to
fix:**
consider applying a *nonlinear transformation *to the dependent and/or
independent variables *if* you can
think of a transformation that seems appropriate. (Don’t just make
something up!) For example, if the data are strictly positive, the log
transformation is an option. (The
logarithm base does not matter--all log functions are same up to linear
scaling--although the natural log is usually preferred because small changes in
the natural log are equivalent to percentage changes. See these notes for more
details.) If a log transformation
is applied to the dependent variable only, this is equivalent to assuming that
it grows (or decays) exponentially as a function of the independent
variables. If a log transformation
is applied to *both* the dependent
variable and the independent variables, this is equivalent to assuming that the
effects of the independent variables are *multiplicative*
rather than additive in their original units. This means that, on the margin, a
small *percentage* change in one of the
independent variables induces a proportional *percentage* change in the expected value of the dependent variable,
other things being equal. Models of
this kind are commonly used in modeling price-demand relationships, as
illustrated on the beer sales
example on this web site.

Another
possibility to consider is adding *another
regressor* that is a nonlinear function of one of the other variables. For
example, if you have regressed Y on X, and the graph of residuals versus
predicted values suggests a parabolic curve, then it may make sense to regress
Y on both X and X^2 (i.e., X-squared). The latter transformation is possible
even when X and/or Y have negative values, whereas logging is not. Higher-order terms of this kind (cubic,
etc.) might also be considered in some cases. But don’t get carried away! This sort of "polynomial curve
fitting" can be a nice way to draw a smooth curve through a wavy pattern
of points (in fact, it is a trend-line option on scatterplots on Excel), but it
is usually a terrible way to extrapolate outside the range of the sample
data.

Finally, it
may be that you have overlooked some *entirely
different independent variable *that explains or corrects for the nonlinear
pattern or interactions among variables that you are seeing in your residual
plots. In that case the shape of the pattern, together with economic or
physical reasoning, may suggest some likely suspects. For example, if the strength of the
linear relationship between Y and X_{1} depends on the level of some
other variable X_{2}, this could perhaps be addressed by creating a new
independent variable that is the product of X_{1} and X_{2}. In the case of time series data, if the
trend in Y is believed to have changed at a particular point in time, then the
addition of a *piecewise linear* trend
variable (one whose string of values looks like 0, 0, …, 0, 1, 2, 3,
… ) could be used to fit the kink in the data. Such a variable can be considered as the
product of a trend variable and a dummy variable. Again, though, you need to beware of
overfitting the sample data by throwing in artificially constructed variables
that are poorly motivated. At the
end of the day you need to be able to interpret the model and explain (or sell)
it to others. (Return
to top of page.)

**Violations
of independence**
are potentially very serious in *time series regression *models: serial
correlation in the errors (i.e., correlation between consecutive errors or
errors separated by some other number of periods) means that there is room for
improvement in the model, and extreme serial correlation is often a symptom of
a badly mis-specified model. Serial correlation (also known as autocorrelation”)
is sometimes a byproduct of a violation of the linearity assumption, as in the
case of a simple (i.e., straight) trend line fitted to data which are growing
exponentially over time.

Independence
can also be violated in non-time-series models if errors tend to always have
the same sign under particular conditions, i.e., if the model systematically
underpredicts or overpredicts what will happen when the independent variables
have a particular configuration.

**How to
diagnose:**
The best test for serial correlation is to look at a **residual time series plot **(residuals vs. row number) and a **table or plot of** **residual autocorrelations**.
(If your software does not provide these by default for time series data, you
should figure out where in the menu or code to find them.) Ideally, most of the
residual autocorrelations should fall within the 95% confidence bands around
zero, which are located at roughly plus-or-minus 2-over-the-square-root-of-n,
where n is the sample size. Thus, if the sample size is 50, the autocorrelations
should be between +/- 0.3. If the sample size is 100, they should be between
+/- 0.2. Pay especially close attention to significant correlations at the
first couple of lags and in the vicinity of the seasonal period, because these
are probably not due to mere chance and are also fixable. The *Durbin-Watson
statistic* provides a test for significant residual autocorrelation at lag
1: the DW stat is approximately equal to 2(1-a) where a is the lag-1 residual
autocorrelation, so ideally it should be close to 2.0--say, between 1.4 and 2.6
for a sample size of 50.

**How to
fix:**
Minor cases of *positive *serial correlation (say, lag-1 residual
autocorrelation in the range 0.2 to 0.4, or a Durbin-Watson statistic between
1.2 and 1.6) indicate that there is some room for fine-tuning in the model.
Consider adding lags of the dependent variable and/or lags of some of the
independent variables. Or, if you have an ARIMA+regressor procedure available
in your statistical software, try adding an AR(1) or
MA(1) term to the regression model.
An AR(1) term adds a lag of the dependent
variable to the forecasting equation, whereas an MA(1) term adds a lag of the
forecast error. If there is significant correlation at lag 2, then a 2nd-order
lag may be appropriate.

If there is
significant *negative* correlation in the residuals (lag-1 autocorrelation
more negative than -0.3 or DW stat greater than 2.6), watch out for the
possibility that you may have *overdifferenced* some of your variables.
Differencing tends to drive autocorrelations in the negative direction, and too
much differencing may lead to artificial patterns of negative correlation that
lagged variables cannot correct for.

If there is
significant correlation at the *seasonal* period (e.g. at lag 4 for
quarterly data or lag 12 for monthly data), this indicates that seasonality has
not been properly accounted for in the model. Seasonality can be handled in a
regression model in one of the following ways: (i) *seasonally adjust* the
variables (if they are not already seasonally adjusted), or (ii) use *seasonal
lags and/or seasonally differenced variables* (caution: be careful not to
overdifference!), or (iii) add *seasonal dummy variables* to the model
(i.e., indicator variables for different seasons of the year, such as MONTH=1
or QUARTER=2, etc.) The dummy-variable approach enables *additive seasonal
adjustment* to be performed as part of the regression model: a different
additive constant can be estimated for each season of the year. If the
dependent variable has been logged, the seasonal adjustment is multiplicative.
(Something else to watch out for: it is possible that although your dependent
variable is already seasonally adjusted, some of your independent variables may
not be, causing their seasonal patterns to leak into the forecasts.)

*Major
cases*
of serial correlation (a Durbin-Watson statistic well below 1.0,
autocorrelations well above 0.5) usually indicate a fundamental structural
problem in the model. You may wish to reconsider the transformations (if any)
that have been applied to the dependent and independent variables. It may help
to stationarize all variables through appropriate combinations of differencing,
logging, and/or deflating.

To test for** non-time-series violations of independence**,
you can look at plots of the residuals versus independent variables or plots of
residuals versus row number in situations where the rows have been sorted or
grouped in some way that depends (only) on the values of the independent
variables. The residuals should be
randomly and symmetrically distributed around zero under all conditions, and in
particular **there should be no
correlation between consecutive errors no matter how the rows are sorted**,
as long as it is on some criterion that does not involve the dependent
variable. If this is not true, it
could be due to a violation of the linearity assumption or due to bias that is
explainable by omitted variables (say, interaction terms or dummies for
identifiable conditions).

**Violations of homoscedasticity** (which are called
"heteroscedasticity") make it difficult to gauge the true standard
deviation of the forecast errors, usually resulting in confidence intervals
that are too wide or too narrow. In particular, if the variance of the errors
is increasing over time, confidence intervals for out-of-sample predictions
will tend to be unrealistically narrow. Heteroscedasticity may also have the
effect of giving too much weight to a small subset of the data (namely the
subset where the error variance was largest) when estimating coefficients.

**How to
diagnose:**
look at a plot of **residuals versus predicted values** and, in the case of time series data, a
plot of **residuals versus time**. Be alert for evidence of residuals that
grow larger either as a function of time or as a function of the predicted
value. To be really thorough, you should also generate plots of **residuals versus independent** **variables** to look for consistency there
as well. Because of imprecision in
the coefficient estimates, the errors may tend to be *slightly* larger for forecasts associated with predictions or values
of independent variables that are extreme in both directions, although the
effect should not be too dramatic.
What you hope *not* to see are
errors that systematically get larger in one direction by a significant amount.

**How to
fix: **If the dependent variable is strictly positive and if the
residual-versus-predicted plot shows that the size of the errors is
proportional to the size of the predictions (i.e., if the errors seem
consistent in percentage rather than absolute terms), a log transformation
applied to the dependent variable may be appropriate. In time series models, heteroscedasticity
often arises due to the effects of inflation and/or real compound growth. Some
combination of *logging and/or deflating* will often stabilize the variance
in this case. Stock market data may show periods of increased or decreased
volatility over time. This is normal and is often modeled with so-called ARCH
(auto-regressive conditional heteroscedasticity) models in which the error
variance is fitted by an autoregressive model. Such models are beyond the scope
of this discussion, but a simple fix would be to work with shorter intervals of
data in which volatility is more nearly constant. Heteroscedasticity can also
be a byproduct of a significant violation of the linearity and/or independence
assumptions, in which case it may also be fixed as a byproduct of fixing those
problem.

*Seasonal patterns* in the data are a
common source of heteroscedasticity in the errors: unexplained variations in the dependent
variable throughout the course of a season may be consistent in percentage
rather than absolute terms, in which case larger errors will be made in seasons
where activity is greater, which will show up as a seasonal pattern of changing
variance on the residual-vs-time plot.
A log transformation is often used to address this problem. For example, if the seasonal pattern is
being modeled through the use of dummy variables for months or quarters of the
year, a log transformation applied to the dependent variable will convert the
coefficients of the dummy variables to multiplicative adjustment factors rather
than additive adjustment factors, and the errors in predicting the logged
variable will be (roughly) interpretable as percentage errors in predicting the
original variable. Seasonal
adjustment of all the data prior to fitting the regression model might be
another option.

If a log
transformation has already been applied to a variable, then (as noted above) *additive* rather than multiplicative
seasonal adjustment should be used, if it is an option that your software
offers. Additive seasonal
adjustment is similar in principle to including dummy variables for seasons of
the year. Whether-or-not you should
perform the adjustment outside the model rather than with dummies depends on
whether you want to be able to study the seasonally adjusted data all by itself
and on whether there are unadjusted seasonal patterns in some of the
independent variables. (The
dummy-variable approach would address the latter problem.) (Return to top of page.)

**Violations
of normality**
create problems for determining whether model coefficients are significantly
different from zero and for calculating confidence intervals for forecasts.
Sometimes the error distribution is "skewed" by the presence of a few
large outliers. Since parameter estimation is based on the minimization of*
squared* error, a few extreme observations can exert a disproportionate
influence on parameter estimates. Calculation of confidence intervals and
various significance tests for coefficients are all based on the assumptions of
normally distributed errors. If the error distribution is significantly
non-normal, confidence intervals may be too wide or too narrow.

Technically,
the normal distribution assumption is not necessary if you are willing to
assume the model equation is correct and your only goal is to estimate its
coefficients and generate predictions in such a way as to minimize mean squared
error. The formulas for estimating
coefficients require no more than that, and some references on regression
analysis do not list normally distributed errors among the key assumptions. But generally we are interested in
making inferences about the model and/or estimating the probability that a
given forecast error will exceed some threshold in a particular direction, in
which case distributional assumptions are important. Also, a significant violation of the
normal distribution assumption is often a "red flag" indicating that
there is some other problem with the model assumptions and/or that there are a
few unusual data points that should be studied closely and/or that a better
model is still waiting out there somewhere.

**How to
diagnose:**
the best test for normally distributed errors is a **normal probability plot**
or **normal quantile plot*** *of the residuals. These are plots of
the fractiles of error distribution versus the fractiles of a normal
distribution having the same mean and variance. If the distribution is normal,
the points on such a plot should fall close to the diagonal reference line. A *bow-shaped*
pattern of deviations from the diagonal indicates that the residuals have
excessive* skewness* (i.e., they are not symmetrically distributed, with too
many large errors in *one* direction).
An S-shaped pattern of deviations indicates that the residuals have excessive *kurtosis*--i.e.,
there are either too many or two few large errors in *both* directions.
Sometimes the problem is revealed to be that there are a few data points
on one or both ends that deviate significantly from the reference line
("outliers"), in which case they should get close attention.

There are
also a variety of **statistical tests for
normality**, including the Kolmogorov-Smirnov
test, the Shapiro-Wilk
test, the Jarque-Bera
test, and the Anderson-Darling
test. The Anderson-Darling test
(which is the one used by RegressIt) is generally considered to be the best, because
it is specific to the normal distribution (unlike the K-S test) and it looks at
the whole distribution rather than just the skewness and kurtosis (like the J-B
test). But all of these tests are
excessively "picky" in this author’s opinion. Real data rarely has errors that are
perfectly normally distributed, and it may not be possible to fit your data
with a model whose errors do not violate the normality assumption at the 0.05
level of significance. It is
usually better to focus more on violations of the other assumptions and/or the
influence of a few outliers (which may be mainly responsible for violations of
normality anyway) and to look at a normal probability plot or normal quantile
plot and draw your own conclusions about whether the problem is serious and
whether it is systematic.

Here is an
example of a bad-looking normal quantile plot (an S-shaped pattern with P=0 for
the A-D stat, indicating highly significant non-normality) from the beer sales analysis on this web
site:

…and
here is an example of a good-looking one (a linear pattern with P=0.5 for the
A-D stat, indicating no significant departure from normality):

**How to
fix: **violations
of normality often arise either because (a) the *distributions of the
dependent and/or independent variables* are themselves significantly
non-normal, and/or (b) the *linearity assumption* is violated. In such
cases, a nonlinear transformation of variables might cure both problems. In the
case of the two normal quantile plots above, the second model was obtained
applying a natural log transformation to the variables in the first one.

The
dependent and independent variables in a regression model do not need to be
normally distributed by themselves--only the prediction errors need to be
normally distributed. (In fact,
independent variables do not even need to be random, as in the case of trend or
dummy or treatment or pricing variables.)
But if the distributions of some of the variables that *are* random are extremely asymmetric or
long-tailed, it may be hard to fit them into a linear model whose errors will
be normally distributed, and explaining the shape of their distributions may be
an interesting topic all by itself.
Keep in mind that the normal error assumption is usually justified by
appeal to the central limit theorem, which holds in the case where many random
variations are added together. If
the underlying sources of randomness are not interacting additively, this
argument fails to hold.

Another
possibility is that there are two or more *subsets*
of the data having *different statistical
properties*, in which case separate models should be built, or else some
data should merely be excluded, provided that there is some a priori criterion
that can be applied to make this determination.

In some
cases, the problem with the error distribution is mainly due to *one or two
very large errors*. Such values should be scrutinized closely: are they *genuine*
(i.e., not the result of data entry errors), are they *explainable*, are *similar
events *likely to occur again in the future, and how *influential* are
they in your model-fitting results? If they are merely errors or if they can be
explained as unique events not likely to be repeated, then you may have cause
to remove them. In some cases, however, it may be that the extreme values in
the data provide the most useful information about values of some of the
coefficients and/or provide the most realistic guide to the magnitudes of
forecast errors. (Return
to top of page.)

Go on to next topic: Additional notes on regression analysis