Notes on linear
regression analysis (pdf file)

Introduction to linear
regression analysis

Mathematics
of simple regression

Regression examples

·
Beer sales vs. price, part 1: descriptive
analysis

·
Beer sales vs. price, part 2: fitting a simple
model

·
Beer sales vs.
price, part 3: transformations of variables

·
Beer sales vs.
price, part 4: additional predictors

·
NC natural gas
consumption vs. temperature

What to
look for in regression output

What’s a good value for R-squared?

What's the bottom line? How to compare models

Testing the assumptions of linear regression

Additional notes on regression analysis

Stepwise and all-possible-regressions

Excel file with
simple regression formulas

Excel file with regression
formulas in matrix form

**Latest
news:** If you are at least a part-time
user of Excel, you should check out the new release of **RegressIt**, a free Excel add-in. See it at __regressit.com__.
The linear regression version runs on both PC's and Macs and has a richer and
easier-to-use interface and much better designed output than other add-ins for
statistical analysis. It may make a good complement if not a substitute for
whatever regression software you are currently using, Excel-based or otherwise.
It also includes extensive built-in documentation and **pop-up teaching notes**. There is a separate logistic regression version with
highly interactive tables and charts that runs on PC's. RegressIt also now
includes a two-way interface
with R that allows you to run linear and logistic regression models in R
without writing any code whatsoever.

*If you have been using Excel's own Data Analysis add-in for
regression (Analysis Toolpak), this is the time to stop.** It has not
changed since it was first introduced in 1995, and it was a poor design even
then. It's a toy (a clumsy one at that), not a tool for serious work. Visit
this page for a discussion: What's wrong with
Excel's Analysis Toolpak for regression*

In the beer sales example, a simple regression fitted to the original variables
(price-per-case and
cases-sold for 18-packs) yields poor results because it makes wrong assumptions
about the nature of the patterns in the data. The relationship between the two
variables is not linear, and if a linear model is fitted anyway, the errors do
not have the distributional properties that a regression model assumes, and
forecasts and lower confidence limits at the upper end of the price range have
negative values. What to do in such
a case? In some situations there
may be omitted variables which, if they could be identified and added to the
model, would correct the problems.
In other situations it could be that breaking the data set up into
subsets, on the basis of ranges of the independent variables, would allow
linear models to fit reasonably well.
And there are more complex model types that could be tried--linear regression
models are merely the simplest place to start. But an often-used and often-successful
strategy is to look for *transformations *of
the original variables that straighten out the curves, normalize the errors,
and/or exploit the time dimension.

In modeling consumer demand, a standard
approach is to apply a** natural log
transformation **to both prices and quantities before fitting a regression
model. As was discussed on the log transformation page in these notes, when a simple linear regression model
is fitted to logged variables, the slope coefficient represents the predicted *percent *change in the dependent variable
per* percent* change in the independent
variable, regardless of their current levels. Larger changes in the independent
variable are predicted to result in a *compounding
*of the marginal effects rather than a linear extrapolation of them. Let’s see how this approach works
on the beer sales data.

Suppose that we apply a natural log
transformation to all 6 of the price and sales variables in the data set, and
let the names of the logged variables be the original variables with
“_LN” appended to them.
(This is the naming convention used by the variable-transformation tool
in RegressIt.) The
correlation matrix and scatterplot matrix of of the logged variables look like
this:

The correlations are slightly stronger
among the logged variables than among the original variables, and the variance
of the vertical deviations from the regression lines in the scatterplots is now
very similar for both large and small values of the horizontal-axis
variable. So, let us try fitting a
simple regression model to the logged 18-pack variables. The summary table for the model is shown
below. The slope coefficient of
-6.705 means that on the margin **a 1%
change in price is predicted to lead to a 6.7% change in sales, in the opposite
direction**, with a compounding of this effect for larger percentage price
changes. The standard error of the
regression of 0.356 is not directly comparable to that of the original model,
because it is measured in log units.
(Return to top of page.)

If the standard error of the regression
in a model fitted to logged data is on the order of 0.1 or less, it can be
roughly interpreted as the standard error of a forecast measured in percentage
terms. For example, if the standard
error of the regression was 0.05, it would be OK to say that the standard error
of a forecast is about 5% of the forecast value. (An increase of 0.05 in the
natural log of variable corresponds to a proportional change of LN(1.05) ≈
0.049, or 4.9%, and a decrease of 0.05 in the natural log corresponds to a
proportional change of LN(0.95) ≈ -0.051, or -5.1%, so +/- 0.05 in log
units is about the same as +/-5% in percentage terms. For a change of +/- 0.1 in the natural
log, the corresponding proportional changes are LN(1.1) ≈ .105 and
LN(0.9) ≈ 0.095, i.e., +10.5% or -9.5%, and so on.) The regression
standard error of 0.356 observed here is too large for that approximation to
apply, i.e., it would not be NOT be OK to say that the standard error of a
forecast is 36% of the forecast value, because the confidence limits are not
sufficiently symmetric around the point forecast. However, this does indicate that the
unexplained variations in demand are fairly large in percentage terms.

The residual statistics table shows that
the distribution of the errors is very close to a normal distribution (the A-D*
stat is very small, with a P-value on the order of 0.5), and the
autocorrelations of the errors are insignificant at the first couple of lags. (Unless we are looking for seasonal
patterns, we usually are only concerned with the first couple of lags as far as
autocorrelations are concerned.)

The line fit plot also looks very
good: the vertical deviations from
the regression line are approximately the same size for large and small
predictions, and they nicely fill the space between the 95% confidence limits.

The rest of the chart output of the
model is shown farther down on this page, and it all looks reasonably good, so
I will not discuss it further. **What is of special interest is the
appearance of the forecasts that are generated by this model when they are
translated from log units back into real units of cases.** This requires applying the EXP
function to the forecasts and their lower and upper confidence limits generated
by the log-log model. (Return to top of page.)

All regression software has the capability
to generate forecasts for additional values of the independent variables
provided by the user. Often the
convention is for the program to *automatically*
generate forecasts for any rows of data where the independent variables are all
present and the dependent variable is missing. That convention is followed in
RegressIt. Here, forecasts for log
sales have been automatically generated for integer-valued prices in the range
from $13 to $20 (numbers that are just outside the historical minimum and
maximum), which were included in the original data file.

Here are the results of applying the EXP
function to the numbers in the table above to convert them back to real units:

Note that all the numbers are positive,
and the widths of the confidence intervals are proportional to the size of the
forecasts. They are not symmetric,
though: they are wider on the high
side than the low side, which is logical.
Here’s a chart generated from the last table, which tells the
story: the nonlinear forecast curve
captures the steeper slope of the pattern in the data at low price levels, and
the confidence limits fit the magnitudes of random variations at both low and
high price levels.

For purposes of comparison, here is a
similarly formatted chart of forecasts produced by the original regression
model for unlogged data. Which
looks more reasonable?

The rest of the chart output from the
log-log model is shown farther down on this page, and it looks fine as
regression models go. The
take-aways from this step of the analysis are the following:

·
The log-log model is well
supported by economic theory and it does a very plausible job of fitting the
price-demand pattern in the beer sales data.

·
The nonlinear curve in the
(unlogged) forecasts adapts to the steeper slope in the demand function at low
price levels.

·
The way in which the width
of the confidence intervals scales in proportion to the forecasts provides a
good fit to the vertical distribution of sales values at different price
levels.

·
It is impossible for the
log-log model’s forecasts or confidence limits for real sales to be
negative.

There is more to be done with this
analysis, namely considering the effects of other variables. Click here to proceed to that step.