Notes
on linear regression analysis (pdf)

Introduction to linear regression analysis

Mathematics
of simple regression

Regression examples

·
Beer sales vs. price, part 1: descriptive
analysis

·
Beer sales vs. price, part 2: fitting a simple
model

·
Beer sales vs. price, part 3: transformations
of variables

·
Beer sales vs.
price, part 4: additional predictors

·
NC natural gas consumption vs. temperature

What to look for in
regression output

What’s a good
value for R-squared?

What's the bottom
line? How to compare models

Testing the assumptions of linear regression

Additional notes on regression analysis

Stepwise and all-possible-regressions

Excel file with
simple regression formulas

Excel file with regression formulas
in matrix form

If you are a PC Excel user, you *must* check this out:

RegressIt: free Excel add-in for linear
regression and multivariate data analysis

**A simple regression example: predicting baseball batting
averages**

Background and data
description

Larger
data file with many more variables (2.5M)

**Background and data description**

Sports
analytics is a booming field. Owners, coaches, and fans are using statistical
measures and models of all kinds to study the performance of players and
teams. A very simple example
is provided by the study of yearly data on batting averages for individual
players in the sport of baseball.
(If you are not familiar with the sport, stop here and watch this video which explains
it all. Another very highly
recommended reference is this
movie.) The file used here
contains 588 rows of data for a select group of players during the years 1960-2004,
and it was obtained from the Lahman Baseball
Database. **The objective of this
exercise will be to predict a player’s batting average in a given year
from his batting average in the previous year and/or his cumulative batting
average over all previous years for which data is available. **A much larger file with 4535 rows
and 82 columns--more players and more statistical measures of performance--is
also included among the links above, in case you would like to do more analysis
of your own.

A
player’s batting average is the ratio of his number of hits to his number
of opportunities-to-hit (so-called “at-bats”). There are 162 games in the season,
and a regular position player (a non-pitcher who is a starter at his position)
typically has 4 or 5 at-bats per game and accumulates 600 or more in a season
if he does not miss many games due to injuries or being benched or
suspended. Most players have
batting averages somewhere between 0.250 and 0.300. Because a player’s batting average
in a given year of his career is an average of a very large number of (almost)
statistically independent random variables, it might be expected to be normally
distributed around its hypothetical true value that is determined by his innate
hitting ability. Also, it is
reasonable to expect that these hypothetical true values are themselves
approximately normally distributed in the population and that they are to some
extent “inherited” from one year of a player’s career to the
next (like inheritability of size in Galton’s sweet
peas). Hence we should not be
surprised to find that the empirical distribution of batting averages of all
players across all years is very close to a normal distribution and that a
player’s performance in a given year is positively correlated with his
batting average in prior years and hence predictable by linear regression. It is not actually necessary for individual
variables in a regression model to be normally distributed—only the
prediction errors need to be normally distributed—but the case in which
all the variables *are* normally
distributed is the best-case scenario and yields the prettiest pictures. (The same type of analysis
could be done with respect to statistical measures of performance in other
sports—say, scoring averages or free-throw percentages in
basketball—and qualitatively similar results would be obtained. Regression-to-the-mean is found everywhere.)

Each row
in the data file contains statistics for a single player for a single year *in which the player had at least 400 at-bats
and also at least 400 at-bats in the previous year*. The latter constraint was imposed to
ensure that only regular players (the best on their teams at their respective
positions) were included and also so that the sample size of at-bats for each
player was large. The statistics
for the analysis consist of batting average, batting average lagged by one
year, and cumulative batting average lagged by one year. The first few rows look like this:

The term
“lagged” means “lagging behind” by a specified number
of periods, i.e., an observation of the same variable in an earlier
period. For example, Hank
Aaron’s value of 0.292 for BattingAverageLAG1 in 1961 is by definition
the same as the value of BattingAverage for him in 1960. In general in this file,
BattingAverageLAG1 in a given row is equal to BattingAverage in the previous
row *if the previous row corresponds to
the previous year for the same player*. (Return to
top of page.)

Let’s
start by looking at descriptive statistics of the 3 variables. Here are the specifications of the
analysis in RegressIt:

The
results are shown below. The mean
value of batting average is 0.277 (“two seventy seven” in baseball
language), and the means of the lagged averages and lagged cumulative averages
are only slightly different. The
correlation between batting average and lagged batting average (0.481) is a little
smaller than the correlation between batting average and lagged *cumulative *batting average (0.538),
i.e., **the player’s batting average
in a given year is better predicted by his cumulative history of performance
than by his performance in the immediately preceding year**.

Note that
the correlation between lagged average and lagged cumulative average is 0.754.
which is much higher than the others.
This is an artifact of the way the set of years to include for a given
player was determined. In the
larger database from which this file was extracted, there are many players with
little or no history prior to their first 400-at-bat year, in which case their
lagged batting average and cumulative lagged batting average are very similar
if not identical in their first record in this file. This is true for Hank Aaron, as seen
above.

Now
let’s look at the scatterplots of batting average versus the two lagged
variables. They show almost perfect
bivariate normal distributions, as we might have expected for the reasons
mentioned above:

The
regression lines and squared correlations are included on the scatterplots (a
feature of RegressIt), so we have a good idea what to expect if simple
regression models are fitted.
(Return to top of page.)

Because
CumulativeAverageLAG1 is slightly more predictive than BattingAverageLAG1,
let’s go ahead and fit the model in which it is the independent variable
from which BattingAverage is predicted.
Here are the model specifications in RegressIt:

The
summary output and line fit plot are shown below. **R-squared **is 0.290, which we already knew, and **adjusted R-squared** is only slightly different (0.288), given the
large sample size. The **standard error of the regression** (which
is approximately the standard deviation of a forecast error) is 0.026, meaning
that a 95% confidence interval around a forecast is roughly the point forecast
plus-or-minus 0.052: quite a lot of
uncertainty!

The
estimated **slope coefficient** is
0.698, which means that a player whose prior cumulative average deviated from
the mean by the amount x is predicted to have a batting average that deviates
from the mean by about 0.7x in the current year, i.e., his batting average is
predicted to regress-to-the-mean by 30% relative to his prior cumulative
batting average. Recall that **the way to think of a simple regression
model is that it predicts the dependent variable to deviate from its mean by an
amount equal to the slope coefficient times the deviation of the independent
variable from its mean.*** *If you think of it that way, you
don’t need to pay attention to the intercept. *The
precise value of the intercept is not of interest unless it is possible for the
independent variable to approach zero,* which it can’t in this case,
otherwise the player would not be given hundreds of chances to bat in a given
year.

The **standard error of the coefficient **is
the (estimated) standard deviation of the error in estimating it, and the **t-statistic** is the coefficient estimate
divided by its own standard error, i.e., its “number of standard errors
from zero.” A t-statistic
whose value is 2 or larger in magnitude is generally considered to indicate
that the coefficient estimate is “significantly” different from
zero and hence that the variable has some genuine predictive value in the
context of the model. Here the
standard error of the slope coefficient (0.045) is very small in comparison to
the point estimate of the coefficient (0.698), so the corresponding t-statistic
is huge (greater than 15).

For the
record, the rest of the model’s output is shown below. The interpretation of such statistics
and plots is discussed in depth in other sections of these notes, but they look
very good here: there is no sign of any problem with the model’s
assumptions that the relationship between the variables is linear and that the
errors are independently and identically normally distributed. (Jump to
next section: multiple regression.) (Return to
top of page.)

From the
pairwise correlations, we know that CumulativeAverageLAG1 is better than BattingAverageLAG1 for
purposes of predicting BattingAverage via simple regression, but it could be
true that including* both *variables in
the model is better than including only one of them. To test this hypothesis, let’s add
BattingAverageLAG1 as a second regressor.
The summary output is shown below.
Both variables are highly significant, although R-squared rises only
from 0.290 to 0.303 and the standard error of the regression is unchanged (to 3
digits of precision). The
estimated coefficients of BattingAverageLAG1 and CumulativeAverageLAG1 are
0.180 and 0.528, respectively, and both are significantly different from zero
as indicated their very large t-statistics (greater than3.3 and 7.7,
respectively)

On the
basis of the error stats alone, it would be hard to argue that the increase in
model complexity is worth it, except that the pattern of the coefficients is in
line with intuition. The
coefficient of lagged cumulative average was 0.698 in the first model. ** In the second model, both coefficients are
positive and their sum is 0.708**, which is essentially the same as the
coefficient in the first model.
This means that** the second model
merely reallocates some of the weight on prior performance to place a little
more on the most recent year’s average.** So, even though the second model does
not noticeably reduce the standard error of the regression with the addition of
another variable, it can still be defended on the basis that it makes sense to
place a little more weight on the most recently observed batting average value,
rather than use a somewhat arbitrary equally weighted average of all available
past data. This example illustrates
that in fitting a model, there is more to be done than to look at error statistics
and significance tests. Interpreting
the values and meaning of the coefficients is also a part of the analysis. (Return
to top of page.)

Go on to example #2: predicting weekly beer sales