Notes on linear
regression analysis (pdf file)

Introduction
to linear regression analysis

Mathematics
of simple regression

Regression examples

·
Beer sales vs. price, part 1: descriptive
analysis

·
Beer sales vs.
price, part 2: fitting a simple model

·
Beer sales vs. price, part 3: transformations
of variables

·
Beer sales vs.
price, part 4: additional predictors

·
NC natural gas
consumption vs. temperature

What to look for in
regression output

What’s a good
value for R-squared?

What's the bottom line? How to compare models

Testing the assumptions of linear regression

Additional notes on regression
analysis

Stepwise and all-possible-regressions

Excel file with
simple regression formulas

Excel file with regression
formulas in matrix form

**Latest news:** If you are at least
a part-time user of Excel, you should check out the new release of **RegressIt**, a free Excel add-in. See it
at regressit.com.
The linear regression version runs on both PC's and Macs and has a richer and
easier-to-use interface and much better designed output than other add-ins for
statistical analysis. It may make a good complement if not a substitute for
whatever regression software you are currently using, Excel-based or otherwise.
It also includes extensive built-in documentation and **pop-up teaching notes**. There is a separate logistic regression version with
highly interactive tables and charts that runs on PC's. RegressIt also now includes a two-way interface with R that
allows you to run linear and logistic regression models in R without writing
any code whatsoever.

*If you
have been using Excel's own Data Analysis add-in for regression (Analysis
Toolpak), this is the time to stop.** It has
not changed since it was first introduced in 1995, and it was a poor design
even then. It's a toy (a clumsy one
at that), not a tool for serious work.
Visit this page for a discussion: What's wrong with Excel's
Analysis Toolpak for regression*

Having already performed some descriptive data
analysis in which we learned quite a bit about relationships and time
patterns among the beer price and beer sales variables, let’s naively
proceed to **fit a simple regression model
to predict sales of 18-packs from price of 18-packs**. I say “naively” because,
although we know that there is a very strong relationship between price and
demand, the scatterplot indicated that there is a problem with one of the
assumptions of a regression model, namely that vertical deviations from the
regression line (prediction errors) ought to have roughly the same variance for
small and large predictions. Also,
there are some logical problems with a model that assumes the relation between
sales and price to be perfectly linear in situations where the smallest values
of sales are tiny in comparison to the largest ones. Putting those concerns aside for the
moment, here is the standard regression summary output (as formatted by
RegressIt) for a model in which SALES_18PK is the dependent variable and
PRICE_18PK is the independent variable:

The numbers are interpreted as follows:

a. The
estimated regression equation is printed out at the top: **Predicted
CASES_18PK = 1,812 - 93.007*PRICE_18PK**. It shows how the coefficients that
appear in the regression summary table below are to be used in predicting sales
from price. This is the equation of
a straight line whose *intercept* is
1812 (the height of the line at the point where PRICE_18PK is equal to zero,
which it will never come close to in practice) and which has a *slope* of -93.007, which means the model
predicts that 93 *fewer *cases worth of
18-packs will be sold per $1 *increase*
in the price per case.

The intercept in a regression model is rarely a number with any direct economic
or physical meaning. The important
thing to keep in mind about a regression model is that *the regression line always passes through the center of mass of the
data*, i.e., the point in coordinate space at which all variables are equal
to their mean values. The slope
coefficient(s) tell you how the expected value of the dependent variable will
move away from its mean value as the independent variables move away from their
own mean values. From the
desciptive statistics table, we know that the center of mass of the data for
this regression is the point at which price-per-case for 18-packs is $16.73 and
the number of cases sold is 257.
Just knowing these two numbers tells us that we should predict 257 cases
sold when the price-per-case is $16.73, regardless of the results of fitting
the regression model.

b. **The most important number in the output, besides the model
coefficients, is the standard error of the regression, which is 131 in this
example.** (To 3 decimal
places it is 130.529, but it should be rounded off to an integer value for
presentation. Don’t confuse
your audience with more decimal places than are significant. Many statistical programs routinely display
10 or more digits of precision in their output tables!) **The standard error of the regression is the estimated standard
deviation of the “noise” in the dependent variable that is
unexplainable by the independent variable(s), and it is a lower bound on the
standard deviation of any of the model’s forecast errors, under the
assumption that the model is correct. ** The estimated standard deviation of a
forecast error is actually slightly *larger*
than the standard error of the regression because it must also take into
account the errors in estimating the parameters of the model, but there is not
much difference if the sample size is reasonably large and the values of the
independent variables are not too extreme.
(For details of how the exact standard errors of forecasts are computed,
see the "Mathematics of simple
regression" page or “Notes on linear
regression” handout.)

From the usual 2-standard-error rule of thumb, it follows that **a 95% confidence interval for a forecast
from the model is approximately equal
to the point forecast plus or minus 2 times the standard error of the
regression**. The

So, in this model, the value of 131 for the standard error of the regression means (roughly) that a 95% confidence interval for any forecast will be equal to the point forecast plus-or-minus 262 cases. We begin to detect a problem here: the number of cases sold is typically much less than 100 in weeks when the price level is high. If the model is capable of giving an accurate point forecast in such a week, then it would be silly to compute a lower confidence limit for the forecast by subtracting 262 from it!

c. **Because this is a simple regression
model, R-squared is merely the square of the correlation between price and
sales: 0.751 = (-0.866) ^{2}.** This is the fraction of the
variance of the dependent variable that is “explained” by the
independent variable, i.e., the fractional amount by which the variance of the
errors is less than the variance of the dependent variable.

d. The
estimated intercept is 1812 (cases) with a standard error of 128. **The
standard error of a coefficient is the estimated
standard deviation of the error in estimating it. ** By the usual rule of thumb, an

e. **The cofficient of PRICE_18PK is -93 with
a standard error of 7.6, and its 95% confidence interval is [-108,-78]. ** This is not too wide an interval
(as these things go), so it appears that we have a reasonably precise estimate
of the strength of the price-demand relationship. Again, though, these numbers are
meaningful only if the model assumptions are approximately correct, in
particular the assumption that the same price-demand relationship holds across
the whole range of prices.

f. **The t-statistic of a coefficient
estimate is its point value divided by its standard error, i.e., its
“number of standard errors away from zero.” ** In general we do not care about the
t-stat of the *intercept* unless it is
possible for all the independent variables to simultaneously go to zero and we
are interested in whether the prediction should also be zero in such a
situation. (That is not the case
here.) The t-stat of the slope
coefficient is -93.007/7.581 = -12.269. By the usual rule of thumb, **a coefficient estimate is significantly
different from zero (at the 0.05 level of significance) if its t-stat is
greater than 2 in magnitude**, which is certainly true here. (Again, the exact critical t-value
is 2.009, not 2, for this model, but the difference is not important.)

The t-statistic of a coefficient estimate exceeds the specified critical value
if and only if the corresponding confidence interval for the coefficient does
not include zero: this is a
mathematical identity. So, it is
equivalent to check to see whether the magnitude of the t-statistic exceeds the
critical value for the given level of confidence or to check to see whether the
correponding confidence interval includes zero.

g. In
the **analysis-of-variance table**, the
only interesting numbers are **the
F-statistic and its P-value**.
The F-statistic tests whether all the independent variables in the model
are “jointly” significant, regardless of whether they are
individually significant. In a
simple regression model, there is only one independent variable, so the the
F-statistic tests its significance alone.
In fact, **in a simple regression
model, the F-statistic is simply the square of the t-statistic of the slope
coefficient, and their P-values are the same. ** In this case we have 150.527 = (-12.269)^{2}. The F-statistic is usually not of
interest unless you have a group of variables that should logically be taken
together as a unit (say, dummy variables for a set of mutually exclusive
conditions, as you might have in a designed experiment), and therefore the
analysis-of-variance table is minimized (hidden) by default in RegressIt.

Now, for a better perspective on how well the model fits the data,
let’s look at the graphical output.
First, here is the line fit plot, which shows the regression line
superimposed on the data. This is
exactly the same line that was superimposed on the scatterplot in the
descriptive data analysis, except that it also shows 95% confidence bands for
forecasts.

Here we see a couple of problems.
First, as already noted,** the
unexplained deviations from the regression line are much larger for low price
levels than for high price levels. ** Under the assumptions of the regression
model, they should be approximately the same size across the whole range, as
indicated by the fact that the 95% confidence bands have an almost constant
width over the whole range. In
theory, at every price level, roughly 95% of the data should fall within the
95% confidence bands, and the *vertical*
deviations from the regression line should be identically normally distributed.
Also note that there is a hole in the middle of the price range: no values in
between the low $15’s and the low $18’s were observed. This is not necessarily a problem as far
as the regression model is concerned:
there is no requirement that values of the *independent* variable(s) should have any particular sort of
distribution. They might even take on only integer values, perhaps just
0’s and 1’s.
What’s important is how the values of the *dependent* variable are distributed, for any given values of the
independent variable(s).

An even more serious logical problem is also observed in this plot: **the
model predicts negative values of sales for prices larger than $19.50 per case
(just barely outside the historical range), and the lower 95% confidence limit
for a prediction is negative for prices larger than about $16.50 per case**. So, the predictions and confidence
intervals for high price levels cannot be taken very seriously! One alternative modeling strategy that
suggests itself is that the sales data for 18-packs could be broken up into two
subsets, one for low prices and one for high prices, in light of the fact that
intermediate price levels are typically not used. However, let us suppose that we are
interested in predicting what *might *happen
if we were to set the price to some arbitrary value in the middle range, for
which it is necessary to fit one model to all the data.

Next let’s look at a **time
series plot of actual and predicted values**, to see how how the forecasts
and data lined up on a week-to-week basis.

Here we see that the model slightly overpredicted the moderate sales
spikes that occurred early in the year and significantly underpredicted the
large spikes that occurred later in the year. This plot should always be expected to
show some regression-to-the-mean on average, but clearly the model is making
errors for large predictions in a very systematic way. In general, this plot is most useful for
detecting significant *time patterns*
in the deviations between forecasts and actual values. Here the problem is not a time pattern
in the errors per se.

The remainder of the chart output for this model is shown below. The **residual-vs-observation-number
chart **(i.e., residuals versus
time, which is always important for time series data) shows some detail that
wasn’t quite as apparent on the previous chart, namely that the model
made a few serious *over*predictions
(errors that are negative in sign) along with the serious
underpredictions. This
plot is also somewhat unsatisfactory for the fact that it shows that nearly all
of the model’s largest errors occurred in the second half of the
year. The reason for this is
clear: most of the price
manipulation and most of the spikes in sales occurred in the second half of the
year. Still, the model assumes that
the errors should have the same variance at all points in time, regardless of the
values of the independent variables.

The** residual-vs-predicted plot**
is where you go to look for evidence of *nonlinearities
*(which would show up here as a *curved*
pattern) or *heteroscedasticity (*errors
that do not have the same variance for all levels of the predictions). Here we see very strong heteroscedasticity,
which was also apparent on the line fit plot: the model makes bigger errors when
making bigger predictions. ** For a simple regression model, the
residual-vs-predicted plot is just a “tilted” copy of the line fit
plot in which the regression line is superimposed on the X-axis, and it is also
flipped left-to-right if the slope coefficient is negative, as it is here.**

Lastly (at the bottom of this page), we have the **residual histogram plot** and **normal
quantile plot **which reveal the shape of the error distribution, along with
the **Anderson-Darling statistic** which
provides a numerical measure of the degree to which the error distribution is
non-normal. The importance of
testing for normality of the error distribution is that the formulas for calculating
confidence intervals for forecasts are based on properties of the normal
distribution. If the error
distribution is far from normal, then there will be some systematic over- or
under-estimation of “tail area probabilities”. Here, the outliers that are seen
in the histogram plot and the S-shaped pattern that is seen in the normal
quantile plot indicate a signficantly non-normal distribution, as does the A-D
stat, whose value of 2.591 is far above the 0.75 [1.04] critical value for
significance at the 0.05 [0.01] level.

The A-D stat, like other statistical tests for normality, should not be
given an exaggerated importance, particularly when sample sizes are very small
or very large. With very small
samples it is hard to determine much at all about the shape of the error
distribution and with very large samples even a tiny departure from normality
will show up as statistically significant.
(My own preference is to focus on shape of the pattern, if any, that is
seen on the normal quantile plot, and also to look closely at the most extreme
errors to try to figure out what went wrong there.) And like other diagnostic test
statistics for the model assumptions,* the
A-D stat is not the bottom line*, just a little red flag that may or may not
wave. Still, it provides some
sort of numerical benchmark to go along with the visual impressions of
non-normality that are obtained from the histogram and quantile plots.

**What’s the bottom
line? **Despite
its high R-squared value, this is not a good model. Its forecasts and their confidence
limits are illogical at high price levels. The relationship between beer
sales and beer price is evidently not linear over such a wide range in prices,
and the model also badly violates the homoscedasticity and normal-distribution
assumptions for the errors.
Let this be a warning: **not all significant relationships are
linear, not all random variables are normally distributed, and not all models
with high R-squared values are good ones.**