Notes on linear
regression analysis (pdf file)

Introduction
to linear regression analysis

Regression examples

Mathematics
of simple regression

·
Beer sales vs. price, part 1: descriptive
analysis

·
Beer sales vs. price, part 2: fitting a simple
model

·
Beer sales vs. price, part 3: transformations
of variables

·
Beer sales vs.
price, part 4: additional predictors

·
NC natural gas
consumption vs. temperature

What to look for in
regression output

What’s a good value for R-squared?

What's the bottom line? How to compare models

Testing the assumptions of linear regression

Additional notes on regression
analysis

Stepwise and all-possible-regressions

Excel file with
simple regression formulas

Excel file with regression
formulas in matrix form

**Latest
news:** If you are at least a part-time
user of Excel, you should check out the new release of **RegressIt**, a free Excel add-in. See it at __regressit.com__.
The linear regression version runs on both PC's and Macs and has a richer and
easier-to-use interface and much better designed output than other add-ins for
statistical analysis. It may make a good complement if not a substitute for
whatever regression software you are currently using, Excel-based or otherwise.
It also includes extensive built-in documentation and **pop-up teaching notes**. There is a separate logistic regression version with
highly interactive tables and charts that runs on PC's. RegressIt also now
includes a two-way interface
with R that allows you to run linear and logistic regression models in R
without writing any code whatsoever.

*If you have been using Excel's own Data Analysis add-in for
regression (Analysis Toolpak), this is the time to stop.** It has not
changed since it was first introduced in 1995, and it was a poor design even
then. It's a toy (a clumsy one at that), not a tool for serious work. Visit
this page for a discussion: What's wrong with
Excel's Analysis Toolpak for regression*

The log-log regression
model for predicting sales of 18-packs from price of 18-packs gave much better
results than the original model fitted to the unlogged variables, and it
yielded an estimated of the elasticity of demand for 18-packs with respect to
their own price. Specifically, the
slope coefficient of 6.705 in that model implied that, on the margin, a 1
percent change in price should be predicted to yield a 6.7% change in demand in
the other direction. But what about
prices of the other carton sizes?
It is reasonable to expect that there will also be substitution effects,
i.e. “cross-price elasticities.” Consumers ought to buy relatively more
[fewer] 18-packs when the price of another carton size is raised
[lowered]. Indeed, the correlations
and scatterplots for sales of 18-packs vs. prices of 12-packs and 30-packs
(taken from the original scatterplot matrix) suggest that there *are* such effects:

To test this hypothesis, let’s add these two variables to the
previous model. Here is the model
summary report produced by RegressIt:

The
coefficients of the two additional variables are significantly different from
zero, and most importantly there has been a significant reduction in the
standard error of the regression, from 0.356 to 0.281. Recall that errors measured in log units
can be *roughly* interpreted as
percentages, if they are on the order of 0.2 or less. These numbers are a bit too large for
that approximation to fit very well, but still, *very* roughly speaking, the standard deviation of the errors in percentage
terms has been reduced from around 36% to around 28%, which is a meaningful
improvement. (Return
to top of page.)

The
estimated coefficient of PRICE_18PK_PN (-6.331) is very close to its value in
the previous model, which is good:
it means the own-price effect is fairly robust to the manipulation of
the other prices. The estimated
coefficient of 0.2016 for PRICE_12PK_LN (its cross-price elasticity with
respect to sales of 18-packs) means roughly that a 1% increase [decrease] in
the price of 12-packs should be predicted to yield a 2% increase [decrease] in
sales of 18-packs, other things being equal. For 30-packs, whose coefficient is 2.47,
the effect is a little stronger.

Now
for the plots. The
actual-and-predicted-vs-obs# plots appears to show a very nice fit:

However,
the residual-vs-obs# plot indicates a bit of a problem: *there
is a noticeable time pattern in the errors, namely an upward trend*. (Errors in the first half of the year
are nearly all negative, while those in the second half are mostly
positive.) This pattern is also
reflected in the residual autocorrelations, which are positive at lags 1 and 2:
0.18 and 0.278 respectively. These
values are actually not quite large enough in magnitude to be significant at
the 0.05 level, although that is not a magical standard. (The critical value for this test is
roughly 2/SQRT(n), where n is the sample size, which is 0.28 in this case.) The appearance of the residual plot
rather than the values of the test statistics is the best diagnostic of the
time pattern here. Always look at
the pictures! (Return
to top of page.)

A *linear trend in the errors* of a
regression model can be addressed by including *a time index variable* (for example, the week-number variable in
this file) as another regressor. If
we do that here, we obtain the following 4-variable multiple regression model:

The
standard error of the regression has been further reduced by a meaningful
amount, from 0.281 to 0.244. The
coefficients of the original variables have wiggled around a bit, but are in
the same ballpark as before. The
coefficient of Week is significantly different from zero (t=4.079), and its
value of 0.011 appears to suggest that the average trend in sales of 18-packs
is about 1.1% per week. But
appearances can sometimes be deceiving:
*in the context of a multiple
regression model, a trend variable adjusts for differences in trends among all
the variables*, rather than measuring the trend in the dependent variable
alone. (Return
to top of page.)

The
updated plots are shown below. The
residual-vs-obs# plot now does not show any systematic time pattern, and the
autocorrelations that were previously seen in the errors have essentially
vanished.

Overall,
this appears to be a good model, and it has a lot to say about how sales of
18-packs are affected by price. The
additional take-aways from this step of the analysis are that:

·
There is significant cross-price elasticity in the data for
sales of 18-packs vs. prices of 12-packs and 30-packs.

·
The trend in sales of 18-packs is not entirely explained by
trends in price

·
This exercise has illustrated the construction of a linear
regression model by a series of logical steps, beginning with descriptive
analysis and proceeding through the fitting of a sequence of models that are
motivated by economic theory and by interpretation of patterns seen in the
statistical and graphical output.

This
is not quite the end of the story:
some final comments and another picture are given below.

Postscript: this is real data, and one thing that is
a bit curious about it is that there is no evidence of seasonality. In general, beer sales in the U.S.
exhibit very significant seasonality, being higher in the hotter months and
having large spikes in certain holiday weeks. Here’s a picture of total weekly
sales of the top-selling brands at a much larger sample of stores: