Notes on linear
regression analysis (pdf file)

Introduction
to linear regression analysis

Regression examples

Mathematics
of simple regression

·
Beer sales vs. price, part 1: descriptive
analysis

·
Beer sales vs. price, part 2: fitting a simple
model

·
Beer sales vs. price, part 3: transformations
of variables

·
Beer sales vs.
price, part 4: additional predictors

·
NC natural gas consumption vs. temperature

What to look for in
regression output

What’s a good value for R-squared?

What's the bottom line? How to compare models

Testing the assumptions of linear regression

Additional notes on regression
analysis

Stepwise and all-possible-regressions

Excel file with
simple regression formulas

Excel file with regression formulas
in matrix form

If you are a PC Excel user, you *must* check this out:

RegressIt: free Excel add-in for
linear regression and multivariate data analysis

The log-log regression
model for predicting sales of 18-packs from price of 18-packs gave much better
results than the original model fitted to the unlogged variables, and it
yielded an estimated of the elasticity of demand for 18-packs with respect to
their own price. Specifically, the
slope coefficient of 6.705 in that model implied that, on the margin, a 1
percent change in price should be predicted to yield a 6.7% change in demand in
the other direction. But what about
prices of the other carton sizes?
It is reasonable to expect that there will also be substitution effects,
i.e. “cross-price elasticities.” Consumers ought to buy relatively more
[fewer] 18-packs when the price of another carton size is raised
[lowered]. Indeed, the correlations
and scatterplots for sales of 18-packs vs. prices of 12-packs and 30-packs
(taken from the original scatterplot matrix) suggest that there *are* such effects:

To test this hypothesis, let’s add these two variables to the
previous model. Here is the model
summary report produced by RegressIt:

The
coefficients of the two additional variables are significantly different from
zero, and most importantly there has been a significant reduction in the
standard error of the regression, from 0.356 to 0.281. Recall that errors measured in log units
can be *roughly* interpreted as
percentages, if they are on the order of 0.2 or less. These numbers are a bit too large for
that approximation to fit very well, but still, *very* roughly speaking, the standard deviation of the errors in
percentage terms has been reduced from around 36% to around 28%, which is a
meaningful improvement. (Return to top of page.)

The
estimated coefficient of PRICE_18PK_PN (-6.331) is very close to its value in
the previous model, which is good:
it means the own-price effect is fairly robust to the manipulation of
the other prices. The estimated
coefficient of 0.2016 for PRICE_12PK_LN (its cross-price elasticity with
respect to sales of 18-packs) means roughly that a 1% increase [decrease] in
the price of 12-packs should be predicted to yield a 2% increase [decrease] in
sales of 18-packs, other things being equal. For 30-packs, whose coefficient is 2.47,
the effect is a little stronger.

Now
for the plots. The
actual-and-predicted-vs-obs# plots appears to show a very nice fit:

However,
the residual-vs-obs# plot indicates a bit of a problem: *there
is a noticeable time pattern in the errors, namely an upward trend*. (Errors in the first half of the year
are nearly all negative, while those in the second half are mostly
positive.) This pattern is also
reflected in the residual autocorrelations, which are positive at lags 1 and 2:
0.18 and 0.278 respectively. These
values are actually not quite large enough in magnitude to be significant at
the 0.05 level, although that is not a magical standard. (The critical value for this test is
roughly 2/SQRT(n), where n is the sample size, which is 0.28 in this
case.) The appearance of the
residual plot rather than the values of the test statistics is the best
diagnostic of the time pattern here.
Always look at the pictures!
(Return to top of page.)

A *linear trend in the errors* of a
regression model can be addressed by including *a time index variable* (for example, the week-number variable in
this file) as another regressor. If
we do that here, we obtain the following 4-variable multiple regression model:

The
standard error of the regression has been further reduced by a meaningful
amount, from 0.281 to 0.244. The coefficients
of the original variables have wiggled around a bit, but are in the same
ballpark as before. The coefficient
of Week is significantly different from zero (t=4.079), and its value of 0.011
appears to suggest that the average trend in sales of 18-packs is about 1.1%
per week. But appearances can
sometimes be deceiving: *in the context of a multiple regression
model, a trend variable adjusts for differences in trends among all the
variables*, rather than measuring the trend in the dependent variable
alone. (Return
to top of page.)

The
updated plots are shown below. The
residual-vs-obs# plot now does not show any systematic time pattern, and the
autocorrelations that were previously seen in the errors have essentially vanished.

Overall,
this appears to be a good model, and it has a lot to say about how sales of
18-packs are affected by price. The
additional take-aways from this step of the analysis are that:

·
There is significant cross-price elasticity in the data for
sales of 18-packs vs. prices of 12-packs and 30-packs.

·
The trend in sales of 18-packs is not entirely explained by
trends in price

·
This exercise has illustrated the construction of a linear regression
model by a series of logical steps, beginning with descriptive analysis and
proceeding through the fitting of a sequence of models that are motivated by
economic theory and by interpretation of patterns seen in the statistical and
graphical output.

This
is not quite the end of the story:
some final comments and another picture are given below.

Postscript: this is real data, and one thing that is
a bit curious about it is that there is no evidence of seasonality. In general, beer sales in the U.S.
exhibit very significant seasonality, being higher in the hotter months and
having large spikes in certain holiday weeks. Here’s a picture of total weekly
sales of the top-selling brands at a much larger sample of stores: