Linear regression models


Notes on linear regression analysis (pdf file)
Introduction to linear regression analysis

Mathematics of simple regression

Regression examples

·         Baseball batting averages

·         Beer sales vs. price, part 1: descriptive analysis

·         Beer sales vs. price, part 2: fitting a simple model

·         Beer sales vs. price, part 3: transformations of variables

·         Beer sales vs. price, part 4: additional predictors

·         NC natural gas consumption vs. temperature

What to look for in regression output
What’s a good value for R-squared?
What's the bottom line? How to compare models
Testing the assumptions of linear regression
Additional notes on regression analysis
Stepwise and all-possible-regressions
Excel file with simple regression formulas

Excel file with regression formulas in matrix form

If you are a PC Excel user, you must check this out:
RegressIt: free Excel add-in for linear regression and multivariate data analysis


Regression example, part 3:  transformations of variables

In the beer sales example, a simple regression fitted to the original variables (price-per-case and cases-sold for 18-packs) yields poor results because it makes wrong assumptions about the nature of the patterns in the data.  The relationship between the two variables is not linear, and if a linear model is fitted anyway, the errors do not have the distributional properties that a regression model assumes, and forecasts and lower confidence limits at the upper end of the price range have negative values.  What to do in such a case?  In some situations there may be omitted variables which, if they could be identified and added to the model, would correct the problems.  In other situations it could be that breaking the data set up into subsets, on the basis of ranges of the independent variables, would allow linear models to fit reasonably well.  And there are more complex model types that could be tried--linear regression models are merely the simplest place to start.  But an often-used and often-successful strategy is to look for transformations of the original variables that straighten out the curves, normalize the errors, and/or exploit the time dimension.

In modeling consumer demand, a standard approach is to apply a natural log transformation to both prices and quantities before fitting a regression model.  As was discussed on the log transformation page in these notes, when a simple linear regression model is fitted to logged variables, the slope coefficient represents the predicted percent change in the dependent variable per percent change in the independent variable, regardless of their current levels.  Larger changes in the independent variable are predicted to result in a compounding of the marginal effects rather than a linear extrapolation of them.  Let’s see how this approach works on the beer sales data.

Suppose that we apply a natural log transformation to all 6 of the price and sales variables in the data set, and let the names of the logged variables be the original variables with “_LN” appended to them.  (This is the naming convention used by the variable-transformation tool in RegressIt.)   The correlation matrix and scatterplot matrix of of the logged variables look like this:

The correlations are slightly stronger among the logged variables than among the original variables, and the variance of the vertical deviations from the regression lines in the scatterplots is now very similar for both large and small values of the horizontal-axis variable.  So, let us try fitting a simple regression model to the logged 18-pack variables.  The summary table for the model is shown below.  The slope coefficient of -6.705 means that on the margin a 1% change in price is predicted to lead to a 6.7% change in sales, in the opposite direction, with a compounding of this effect for larger percentage price changes.  The standard error of the regression of 0.356 is not directly comparable to that of the original model, because it is measured in log units.   (Return to top of page.)

If the standard error of the regression in a model fitted to logged data is on the order of 0.1 or less, it can be roughly interpreted as the standard error of a forecast measured in percentage terms.  For example, if the standard error of the regression was 0.05, it would be OK to say that the standard error of a forecast is about 5% of the forecast value. (An increase of 0.05 in the natural log of variable corresponds to a proportional change of LN(1.05) ≈ 0.049, or 4.9%, and a decrease of 0.05 in the natural log corresponds to a proportional change of LN(0.95) ≈ -0.051, or -5.1%, so +/- 0.05 in log units is about the same as +/-5% in percentage terms.  For a change of +/- 0.1 in the natural log, the corresponding proportional changes are LN(1.1) ≈ .105 and LN(0.9) ≈ 0.095, i.e., +10.5% or -9.5%, and so on.) The regression standard error of 0.356 observed here is too large for that approximation to apply, i.e., it would not be NOT be OK to say that the standard error of a forecast is 36% of the forecast value, because the confidence limits are not sufficiently symmetric around the point forecast.  However, this does indicate that the unexplained variations in demand are fairly large in percentage terms.

The residual statistics table shows that the distribution of the errors is very close to a normal distribution (the A-D* stat is very small, with a P-value on the order of 0.5), and the autocorrelations of the errors are insignificant at the first couple of lags.  (Unless we are looking for seasonal patterns, we usually are only concerned with the first couple of lags as far as autocorrelations are concerned.)

The line fit plot also looks very good:  the vertical deviations from the regression line are approximately the same size for large and small predictions, and they nicely fill the space between the 95% confidence limits.

The rest of the chart output of the model is shown farther down on this page, and it all looks reasonably good, so I will not discuss it further.  What is of special interest is the appearance of the forecasts that are generated by this model when they are translated from log units back into real units of cases.   This requires applying the EXP function to the forecasts and their lower and upper confidence limits generated by the log-log model.   (Return to top of page.)

All regression software has the capability to generate forecasts for additional values of the independent variables provided by the user.  Often the convention is for the program to automatically generate forecasts for any rows of data where the independent variables are all present and the dependent variable is missing.  That convention is followed in RegressIt.  Here, forecasts for log sales have been automatically generated for integer-valued prices in the range from $13 to $20 (numbers that are just outside the historical minimum and maximum), which were included in the original data file.

Here are the results of applying the EXP function to the numbers in the table above to convert them back to real units:

Note that all the numbers are positive, and the widths of the confidence intervals are proportional to the size of the forecasts.  They are not symmetric, though:  they are wider on the high side than the low side, which is logical.  Here’s a chart generated from the last table, which tells the story:  the nonlinear forecast curve captures the steeper slope of the pattern in the data at low price levels, and the confidence limits fit the magnitudes of random variations at both low and high price levels.

For purposes of comparison, here is a similarly formatted chart of forecasts produced by the original regression model for unlogged data.  Which looks more reasonable?

The rest of the chart output from the log-log model is shown farther down on this page, and it looks fine as regression models go.  The take-aways from this step of the analysis are the following:

·       The log-log model is well supported by economic theory and it does a very plausible job of fitting the price-demand pattern in the beer sales data.

·       The nonlinear curve in the (unlogged) forecasts adapts to the steeper slope in the demand function at low price levels.

·       The way in which the width of the confidence intervals scales in proportion to the forecasts provides a good fit to the vertical distribution of sales values at different price levels.

·       It is impossible for the log-log model’s forecasts or confidence limits for real sales to be negative.

There is more to be done with this analysis, namely considering the effects of other variables.  Click here to proceed to that step.

(Return to top of page.)