What's the bottom line?
How to compare models
Testing the assumptions of linear regression
Additional notes on regression analysis
Stepwise and all-possible-regressions
Excel file with simple regression formulas
NEWS (12/14/2017): The 2018 version of RegressIt, a free Excel add-in for regression analysis which runs on both PCs and Macs, is now available. It has many innovative and unique features to support and teach thoughtful data analysis, including a 40-button navigation ribbon, multidimensional audit trail, built-in teaching notes, and an interface with R. Check it out!
Background and data description
Sports analytics is a booming field. Owners, coaches, and fans are using statistical measures and models of all kinds to study the performance of players and teams. A very simple example is provided by the study of yearly data on batting averages for individual players in the sport of baseball. (If you are not familiar with the sport, stop here and watch this video which explains it all. Another very highly recommended reference is this movie.) The file used here contains 588 rows of data for a select group of players during the years 1960-2004, and it was obtained from the Lahman Baseball Database. The objective of this exercise will be to predict a player’s batting average in a given year from his batting average in the previous year and/or his cumulative batting average over all previous years for which data is available. A much larger file with 4535 rows and 82 columns--more players and more statistical measures of performance--is also included among the links above, in case you would like to do more analysis of your own.
A player’s batting average is the ratio of his number of hits to his number of opportunities-to-hit (so-called “at-bats”). There are 162 games in the season, and a regular position player (a non-pitcher who is a starter at his position) typically has 4 or 5 at-bats per game and accumulates 600 or more in a season if he does not miss many games due to injuries or being benched or suspended. Most players have batting averages somewhere between 0.250 and 0.300. Because a player’s batting average in a given year of his career is an average of a very large number of (almost) statistically independent random variables, it might be expected to be normally distributed around its hypothetical true value that is determined by his innate hitting ability. Also, it is reasonable to expect that these hypothetical true values are themselves approximately normally distributed in the population and that they are to some extent “inherited” from one year of a player’s career to the next (like inheritability of size in Galton’s sweet peas). Hence we should not be surprised to find that the empirical distribution of batting averages of all players across all years is very close to a normal distribution and that a player’s performance in a given year is positively correlated with his batting average in prior years and hence predictable by linear regression. It is not actually necessary for individual variables in a regression model to be normally distributed—only the prediction errors need to be normally distributed—but the case in which all the variables are normally distributed is the best-case scenario and yields the prettiest pictures. (The same type of analysis could be done with respect to statistical measures of performance in other sports—say, scoring averages or free-throw percentages in basketball—and qualitatively similar results would be obtained. Regression-to-the-mean is found everywhere.)
Each row in the data file contains statistics for a single player for a single year in which the player had at least 400 at-bats and also at least 400 at-bats in the previous year. The latter constraint was imposed to ensure that only regular players (the best on their teams at their respective positions) were included and also so that the sample size of at-bats for each player was large. The statistics for the analysis consist of batting average, batting average lagged by one year, and cumulative batting average lagged by one year. The first few rows look like this:
The term “lagged” means “lagging behind” by a specified number of periods, i.e., an observation of the same variable in an earlier period. For example, Hank Aaron’s value of 0.292 for BattingAverageLAG1 in 1961 is by definition the same as the value of BattingAverage for him in 1960. In general in this file, BattingAverageLAG1 in a given row is equal to BattingAverage in the previous row if the previous row corresponds to the previous year for the same player. (Return to top of page.)
Let’s start by looking at descriptive statistics of the 3 variables. Here are the specifications of the analysis in RegressIt:
The results are shown below. The mean value of batting average is 0.277 (“two seventy seven” in baseball language), and the means of the lagged averages and lagged cumulative averages are only slightly different. The correlation between batting average and lagged batting average (0.481) is a little smaller than the correlation between batting average and lagged cumulative batting average (0.538), i.e., the player’s batting average in a given year is better predicted by his cumulative history of performance than by his performance in the immediately preceding year.
Note that the correlation between lagged average and lagged cumulative average is 0.754. which is much higher than the others. This is an artifact of the way the set of years to include for a given player was determined. In the larger database from which this file was extracted, there are many players with little or no history prior to their first 400-at-bat year, in which case their lagged batting average and cumulative lagged batting average are very similar if not identical in their first record in this file. This is true for Hank Aaron, as seen above.
Now let’s look at the scatterplots of batting average versus the two lagged variables. They show almost perfect bivariate normal distributions, as we might have expected for the reasons mentioned above:
The regression lines and squared correlations are included on the scatterplots (a feature of RegressIt), so we have a good idea what to expect if simple regression models are fitted. (Return to top of page.)
Because CumulativeAverageLAG1 is slightly more predictive than BattingAverageLAG1, let’s go ahead and fit the model in which it is the independent variable from which BattingAverage is predicted. Here are the model specifications in RegressIt:
The summary output and line fit plot are shown below. R-squared is 0.290, which we already knew, and adjusted R-squared is only slightly different (0.288), given the large sample size. The standard error of the regression (which is approximately the standard deviation of a forecast error) is 0.026, meaning that a 95% confidence interval around a forecast is roughly the point forecast plus-or-minus 0.052: quite a lot of uncertainty!
The estimated slope coefficient is 0.698, which means that a player whose prior cumulative average deviated from the mean by the amount x is predicted to have a batting average that deviates from the mean by about 0.7x in the current year, i.e., his batting average is predicted to regress-to-the-mean by 30% relative to his prior cumulative batting average. Recall that the way to think of a simple regression model is that it predicts the dependent variable to deviate from its mean by an amount equal to the slope coefficient times the deviation of the independent variable from its mean. If you think of it that way, you don’t need to pay attention to the intercept. The precise value of the intercept is not of interest unless it is possible for the independent variable to approach zero, which it can’t in this case, otherwise the player would not be given hundreds of chances to bat in a given year.
The standard error of the coefficient is the (estimated) standard deviation of the error in estimating it, and the t-statistic is the coefficient estimate divided by its own standard error, i.e., its “number of standard errors from zero.” A t-statistic whose value is 2 or larger in magnitude is generally considered to indicate that the coefficient estimate is “significantly” different from zero and hence that the variable has some genuine predictive value in the context of the model. Here the standard error of the slope coefficient (0.045) is very small in comparison to the point estimate of the coefficient (0.698), so the corresponding t-statistic is huge (greater than 15).
For the record, the rest of the model’s output is shown below. The interpretation of such statistics and plots is discussed in depth in other sections of these notes, but they look very good here: there is no sign of any problem with the model’s assumptions that the relationship between the variables is linear and that the errors are independently and identically normally distributed. (Jump to next section: multiple regression.) (Return to top of page.)
From the pairwise correlations, we know that CumulativeAverageLAG1 is better than BattingAverageLAG1 for purposes of predicting BattingAverage via simple regression, but it could be true that including both variables in the model is better than including only one of them. To test this hypothesis, let’s add BattingAverageLAG1 as a second regressor. The summary output is shown below. Both variables are highly significant, although R-squared rises only from 0.290 to 0.303 and the standard error of the regression is unchanged (to 3 digits of precision). The estimated coefficients of BattingAverageLAG1 and CumulativeAverageLAG1 are 0.180 and 0.528, respectively, and both are significantly different from zero as indicated their very large t-statistics (greater than3.3 and 7.7, respectively)
On the basis of the error stats alone, it would be hard to argue that the increase in model complexity is worth it, except that the pattern of the coefficients is in line with intuition. The coefficient of lagged cumulative average was 0.698 in the first model. In the second model, both coefficients are positive and their sum is 0.708, which is essentially the same as the coefficient in the first model. This means that the second model merely reallocates some of the weight on prior performance to place a little more on the most recent year’s average. So, even though the second model does not noticeably reduce the standard error of the regression with the addition of another variable, it can still be defended on the basis that it makes sense to place a little more weight on the most recently observed batting average value, rather than use a somewhat arbitrary equally weighted average of all available past data. This example illustrates that in fitting a model, there is more to be done than to look at error statistics and significance tests. Interpreting the values and meaning of the coefficients is also a part of the analysis. (Return to top of page.)