<![if !vml]><![endif]>Linear regression models
Notes on linear regression analysis (pdf file)
Introduction to linear regression analysis
Mathematics of simple regression
<![if !supportLists]>· <![endif]>Baseball batting averages
<![if !supportLists]>· <![endif]>Beer sales vs. price, part 1: descriptive analysis
<![if !supportLists]>· <![endif]>Beer sales vs. price, part 2: fitting a simple model
<![if !supportLists]>· <![endif]>Beer sales vs. price, part 3: transformations of variables
<![if !supportLists]>· <![endif]>Beer sales vs. price, part 4: additional predictors
<![if !supportLists]>· <![endif]>NC natural gas consumption vs. temperature
<![if !supportLists]>· <![endif]>More regression datasets at regressit.com
What to look for in regression output
What’s a good
value for R-squared?
What's the bottom line? How to compare models
Testing the assumptions of linear regression
Additional notes on regression analysis
Stepwise and all-possible-regressions
Excel file with simple regression formulas
Excel file with regression formulas in matrix form
Notes on logistic regression (new!)
If you use Excel in your work or in your teaching to any extent, you should check out the latest release of RegressIt, a free Excel add-in for linear and logistic regression. See it at regressit.com. The linear regression version runs on both PC's and Macs and has a richer and easier-to-use interface and much better designed output than other add-ins for statistical analysis. It may make a good complement if not a substitute for whatever regression software you are currently using, Excel-based or otherwise. RegressIt is an excellent tool for interactive presentations, online teaching of regression, and development of videos of examples of regression modeling. It includes extensive built-in documentation and pop-up teaching notes as well as some novel features to support systematic grading and auditing of student work on a large scale. There is a separate logistic regression version with highly interactive tables and charts that runs on PC's. RegressIt also now includes a two-way interface with R that allows you to run linear and logistic regression models in R without writing any code whatsoever.
If you have been using Excel's own Data Analysis add-in for regression (Analysis Toolpak), this is the time to stop. It has not changed since it was first introduced in 1993, and it was a poor design even then. It's a toy (a clumsy one at that), not a tool for serious work. Visit this page for a discussion: What's wrong with Excel's Analysis Toolpak for regression
Additional notes on linear regression analysis
To include or not to include the CONSTANT?
Interpreting STANDARD ERRORS, "t" STATISTICS, and SIGNIFICANCE LEVELS of coefficients
Interpreting the F-RATIO
Interpreting measures of multicollinearity: CORRELATIONS AMONG COEFFICIENT ESTIMATES and VARIANCE INFLATION FACTORS
Interpreting CONFIDENCE INTERVALS
TYPES of confidence intervals
Dealing with OUTLIERS
Caution: MISSING VALUES may cause variations in SAMPLE SIZE
MULTIPLICATIVE regression models and the LOGARITHM transformation
To include or not to include the CONSTANT?
Most multiple regression models include a constant term (i.e., an "intercept"), since this ensures that the model will be unbiased--i.e., the mean of the residuals will be exactly zero. (The coefficients in a regression model are estimated by least squares--i.e., minimizing the mean squared error. Now, the mean squared error is equal to the variance of the errors plus the square of their mean: this is a mathematical identity. Changing the value of the constant in the model changes the mean of the errors but doesn't affect the variance. Hence, if the sum of squared errors is to be minimized, the constant must be chosen such that the mean of the errors is zero.) In a simple regression model, the constant represents the Y-intercept of the regression line, in unstandardized form. In a multiple regression model, the constant represents the value that would be predicted for the dependent variable if all the independent variables were simultaneously equal to zero--a situation which may not physically or economically meaningful. If you are not particularly interested in what would happen if all the independent variables were simultaneously zero, then you normally leave the constant in the model regardless of its statistical significance. In addition to ensuring that the in-sample errors are unbiased, the presence of the constant allows the regression line to "seek its own level" and provide the best fit to data which may only be locally linear.
However, in rare cases you may wish to exclude the constant from the model. This is a model-fitting option in the regression procedure in any software package, and it is sometimes referred to as regression through the origin, or RTO for short. Usually, this will be done only if (i) it is possible to imagine the independent variables all assuming the value zero simultaneously, and you feel that in this case it should logically follow that the dependent variable will also be equal to zero; or else (ii) the constant is redundant with the set of independent variables you wish to use. An example of case (i) would be a model in which all variables--dependent and independent--represented first differences of other time series. If you are regressing the first difference of Y on the first difference of X, you are directly predicting changes in Y as a linear function of changes in X, without reference to the current levels of the variables. In this case it might be reasonable (although not required) to assume that Y should be unchanged, on the average, whenever X is unchanged--i.e., that Y should not have an upward or downward trend in the absence of any change in the level of X. An example of case (ii) would be a situation in which you wish to use a full set of seasonal indicator variables--e.g., you are using quarterly data, and you wish to include variables Q1, Q2, Q3, and Q4 representing additive seasonal effects. Thus, Q1 might look like 1 0 0 0 1 0 0 0 ..., Q2 would look like 0 1 0 0 0 1 0 0 ..., and so on. You could not use all four of these and a constant in the same model, since Q1+Q2+Q3+Q4 = 1 1 1 1 1 1 1 1 . . . . , which is the same as a constant term. I.e., the five variables Q1, Q2, Q3, Q4, and CONSTANT are not linearly independent: any one of them can be expressed as a linear combination of the other four. A technical prerequisite for fitting a linear regression model is that the independent variables must be linearly independent; otherwise the least-squares coefficients cannot be determined uniquely, and we say the regression "fails."
A word of warning: R-squared and the F statistic do not have the same meaning in an RTO model as they do in an ordinary regression model, and they are not calculated in the same way by all software. See page 77 of this article for the formulas and some caveats about RTO in general. You should not try to compare R-squared between models that do and do not include a constant term, although it is OK to compare the standard error of the regression.
Note that the term "independent" is used in (at least) three different ways in regression jargon: any single variable may be called an independent variable if it is being used as a predictor, rather than as the predictee. A group of variables is linearly independent if no one of them can be expressed exactly as a linear combination of the others. A pair of variables is said to be statistically independent if they are not only linearly independent but also utterly uninformative with respect to each other. In a regression model, you want your dependent variable to be statistically dependent on the independent variables, which must be linearly (but not necessarily statistically) independent among themselves. Got it? (Return to top of page.)
Interpreting STANDARD ERRORS, t-STATISTICS, AND SIGNIFICANCE LEVELS OF COEFFICIENTS
Your regression output not only gives point estimates of the coefficients of the variables in the regression equation, it also gives information about the precision of these estimates. Under the assumption that your regression model is correct--i.e., that the dependent variable really is a linear function of the independent variables, with independent and identically normally distributed errors--the coefficient estimates are expected to be unbiased and their errors are normally distributed. The standard errors of the coefficients are the (estimated) standard deviations of the errors in estimating them. In general, the standard error of the coefficient for variable X is equal to the standard error of the regression times a factor that depends only on the values of X and the other independent variables (not on Y), and which is roughly inversely proportional to the standard deviation of X. Now, the standard error of the regression may be considered to measure the overall amount of "noise" in the data, whereas the standard deviation of X measures the strength of the "signal" in X. Hence, you can think of the standard error of the estimated coefficient of X as the reciprocal of the signal-to-noise ratio for observing the effect of X on Y. The larger the standard error of the coefficient estimate, the worse the signal-to-noise ratio--i.e., the less precise the measurement of the coefficient.
The t-statistics for the independent variables are equal to their coefficient estimates divided by their respective standard errors. In theory, the t-statistic of any one variable may be used to test the hypothesis that the true value of the coefficient is zero (which is to say, the variable should not be included in the model). If the regression model is correct (i.e., satisfies the "four assumptions"), then the estimated values of the coefficients should be normally distributed around the true values. In particular, if the true value of a coefficient is zero, then its estimated coefficient should be normally distributed with mean zero. If the standard deviation of this normal distribution were exactly known, then the coefficient estimate divided by the (known) standard deviation would have a standard normal distribution, with a mean of 0 and a standard deviation of 1. But the standard deviation is not exactly known; instead, we have only an estimate of it, namely the standard error of the coefficient estimate. Now, the coefficient estimate divided by its standard error does not have the standard normal distribution, but instead something closely related: the "Student's t" distribution with n - p degrees of freedom, where n is the number of observations fitted and p is the number of coefficients estimated, including the constant. The t distribution resembles the standard normal distribution, but has somewhat fatter tails--i.e., relatively more extreme values. However, the difference between the t and the standard normal is negligible if the number of degrees of freedom is more than about 30.
In a standard normal distribution, only 5% of the values fall outside the range plus-or-minus 2. Hence, as a rough rule of thumb, a t-statistic larger than 2 in absolute value would have a 5% or smaller probability of occurring by chance if the true coefficient were zero. Most stat packages will compute for you the exact probability of exceeding the observed t-value by chance if the true coefficient were zero. This is labeled as the "P-value" or "significance level" in the table of model coefficients. A low value for this probability indicates that the coefficient is significantly different from zero, i.e., it seems to contribute something to the model.
Usually you are on the lookout for variables that could be removed without seriously affecting the standard error of the regression. A low t-statistic (or equivalently, a moderate-to-large exceedance probability) for a variable suggests that the standard error of the regression would not be adversely affected by its removal. The commonest rule-of-thumb in this regard is to remove the least important variable if its t-statistic is less than 2 in absolute value, and/or the exceedance probability is greater than .05. Of course, the proof of the pudding is still in the eating: if you remove a variable with a low t-statistic and this leads to an undesirable increase in the standard error or the regression (or deterioration of some other statistics, such as residual autocorrelations), then you should probably put it back in.
Generally you should only add or remove variables one at a time, in a stepwise fashion, since when one variable is added or removed, the other variables may increase or decrease in significance. For example, if X1 is the least significant variable in the original regression, but X2 is almost equally insignificant, then you should try removing X1 first and see what happens to the estimated coefficient of X2: the latter may remain insignificant after X1 is removed, in which case you might try removing X2 as well, or it may rise in significance (with a very different estimated value), in which case you may wish to leave it in.
Note: the t-statistic is usually not used as a basis for deciding whether or not to include the constant term. Usually the decision to include or exclude the constant is based on a priori reasoning, as noted above. If it is included, it may not have direct economic significance, and you generally don't scrutinize its t-statistic too closely. Return to top of page
The F-ratio and its exceedance probability provide a test of the significance of all the independent variables (other than the constant term) taken together. The variance of the dependent variable may be considered to initially have n-1 degrees of freedom, since n observations are initially available (each including an error component that is "free" from all the others in the sense of statistical independence); but one degree of freedom is used up in computing the sample mean around which to measure the variance--i.e., in estimating the constant term alone. As noted above, the effect of fitting a regression model with p coefficients including the constant is to decompose this variance into an "explained" part and an "unexplained" part. The explained part may be considered to have used up p-1 degrees of freedom (since this is the number of coefficients estimated besides the constant), and the unexplained part has the remaining unused n - p degrees of freedom.
The F-ratio is the ratio of the explained-variance-per-degree-of-freedom-used to the unexplained-variance-per-degree-of-freedom-unused, i.e.:
F = ((Explained variance)/(p-1) )/((Unexplained variance)/(n - p))
Now, a set of n observations could in principle be perfectly fitted by a model with a constant and any n - 1 linearly independent other variables--i.e., n total variables--even if the independent variables had no predictive power in a statistical sense. This suggests that any irrelevant variable added to the model will, on the average, account for a fraction 1/(n-1) of the original variance. Thus, if the true values of the coefficients are all equal to zero (i.e., if all the independent variables are in fact irrelevant), then each coefficient estimated might be expected to merely soak up a fraction 1/(n - 1) of the original variance. In this case, the numerator and the denominator of the F-ratio should both have approximately the same expected value; i.e., the F-ratio should be roughly equal to 1. On the other hand, if the coefficients are really not all zero, then they should soak up more than their share of the variance, in which case the F-ratio should be significantly larger than 1. Standard regression output includes the F-ratio and also its exceedance probability--i.e., the probability of getting as large or larger a value merely by chance if the true coefficients were all zero. (In Statgraphics this is shown in the ANOVA table obtained by selecting "ANOVA" from the tabular options menu that appears after fitting the model. The ANOVA table is also hidden by default in RegressIt output but can be displayed by clicking the "+" symbol next to its title.) As with the exceedance probabilities for the t-statistics, smaller is better. A low exceedance probability (say, less than .05) for the F-ratio suggests that at least some of the variables are significant.
In a simple regression model, the F-ratio is simply the square of the t-statistic of the (single) independent variable, and the exceedance probability for F is the same as that for t. In a multiple regression model, the exceedance probability for F will generally be smaller than the lowest exceedance probability of the t-statistics of the independent variables (other than the constant). Hence, if at least one variable is known to be significant in the model, as judged by its t-statistic, then there is really no need to look at the F-ratio. The F-ratio is useful primarily in cases where each of the independent variables is only marginally significant by itself but there are a priori grounds for believing that they are significant when taken as a group, in the context of a model where there is a logical way to group them. For example, the independent variables might be dummy variables for treatment levels in a designed experiment, and the question might be whether there is evidence for an overall effect, even if its fine detail cannot quantified precisely with the given sample size. (Return to top of page.)
Interpreting measures of multicollinearity: CORRELATIONS AMONG COEFFICIENT ESTIMATES and VARIANCE INFLATION FACTORS
It often happens that there are several available independent variables that are plausibly linearly related to the dependent variable but also strongly linearly related to each other. That is to say, their information value is not really independent with respect to prediction of the dependent variable in the context of a linear model. (Such a situation is often observed, for example, when the independent variables are a collection of economic indicators that are computed from some of the same underlying data via weighted averages.) This condition is referred to as multicollinearity. In the most extreme cases of multicollinearity--e.g., when one of the independent variables is an exact linear combination of some of the others--the regression calculation will fail, and you will need to look closely at the definitions of your variables to determine which ones are the culprits. Sometimes one variable is merely a rescaled copy of another variable or a sum or difference of other variables, and sometimes a set of dummy variables adds up to a constant variable
The correlation matrix of the estimated coefficients (if your software includes it) is one diagnostic tool for detecting relative degrees of multicollinearity. It shows the extent to which particular pairs of variables provide independent information for purposes of predicting the dependent variable, given the presence of other variables in the model. Extremely high values here (say, much above 0.9 in absolute value) suggest that some pairs of variables are not providing independent information. In this case, either (i) both variables are providing the same information--i.e., they are redundant; or (ii) there is some linear function of the two variables (e.g., their sum or difference) that summarizes the information they carry.
In case (i)--i.e., redundancy--the estimated coefficients of the two variables are often large in magnitude, with standard errors that are also large, and they are not economically meaningful. When this happens, it is usually desirable to try removing one of them, usually the one whose coefficient has the higher P-value. In case (ii), it may be possible to replace the two variables by the appropriate linear function (e.g., their sum or difference) if you can identify it, but this is not strictly necessary. This situation often arises when two or more different lags of the same variable are used as independent variables in a time series regression model. (Coefficient estimates for different lags of the dependent variable are often highly correlated, but this may or may not be a problem unless the pattern of their coefficients is such that there is a “unit root” in the resulting equation. See the mathematics-of-ARIMA-models notes for more discussion of unit roots.)
Many statistical analysis programs report variance inflation factors (VIF’s), which are another measure of multicollinearity, in addition to or instead of the correlation matrix of coefficient estimates. The VIF of an independent variable is the value of 1 divided by 1-minus-R-squared in a regression of itself on the other independent variables. The rule of thumb here is that a VIF larger than 10 is an indicator of potentially significant multicollinearity between that variable and one or more others. (Note that a VIF larger than 10 means that the regression of that independent variable on the others has an R-squared of greater than 90%.) If this is observed, it means that the variable in question does not contain much independent information in the presence of all the other variables, taken as a group. When this happens, it often happens for many variables at once, and it may take some trial and error to figure out which one(s) ought to be removed. However, like most other diagnostic tests, the VIF-greater-than-10 test is not a hard-and-fast rule, just an arbitrary threshold that indicates the possibility of a problem. In this case it indicates a possibility that the model could be simplified, perhaps by deleting variables or perhaps by redefining them in a way that better separates their contributions.
Interpreting CONFIDENCE INTERVALS
Suppose that you fit a regression model to a certain time series--say, some sales data--and the fitted model predicts that sales in the next period will be $83.421M. Does this mean you should expect sales to be exactly $83.421M? Of course not. This is merely what we would call a "point estimate" or "point prediction." It should really be considered as an average taken over some range of likely values. For a point estimate to be really useful, it should be accompanied by information concerning its degree of precision--i.e., the width of the range of likely values. We would like to be able to state how confident we are that actual sales will fall within a given distance--say, $5M or $10M--of the predicted value of $83.421M.
In "classical" statistical methods such as linear regression, information about the precision of point estimates is usually expressed in the form of confidence intervals. For example, the regression model above might yield the additional information that "the 95% confidence interval for next period's sales is $75.910M to $90.932M." Does this mean that, based on all the available data, we should conclude that there is a 95% probability of next period's sales falling in the interval from $75.910M to $90.932M? That is, should we consider it a "19-to-1 long shot" that sales would fall outside this interval, for purposes of betting? The answer to this is:
Rather, a 95% confidence interval is an interval calculated by a formula having the property that, in the long run, it will cover the true value 95% of the time in situations in which the correct model has been fitted. In other words, if everybody all over the world used this formula on correct models fitted to his or her data, year in and year out, then you would expect an overall average "hit rate" of 95%. Alas, you never know for sure whether you have identified the correct model for your data, although residual diagnostics help you rule out obviously incorrect ones. So, on your data today there is no guarantee that 95% of the computed confidence intervals will cover the true values, nor that a single confidence interval has, based on the available data, a 95% chance of covering the true value. This is not to say that a confidence interval cannot be meaningfully interpreted, but merely that it shouldn't be taken too literally in any single case, especially if there is any evidence that some of the model assumptions are not correct.
In fitting a model to a given data set, you are often simultaneously estimating many things: e.g., coefficients of different variables, predictions for different future observations, etc. Thus, a model for a given data set may yield many different sets of confidence intervals. You may wonder whether it is valid to take the long-run view here: e.g., if I calculate 95% confidence intervals for "enough different things" from the same data, can I expect about 95% of them to cover the true values? The answer to this is:
This is another issue that depends on the correctness of the model and the representativeness of the data set, particularly in the case of time series data. If the model is not correct or there are unusual patterns in the data, then if the confidence interval for one period's forecast fails to cover the true value, it is relatively more likely that the confidence interval for a neighboring period's forecast will also fail to cover the true value, because the model may have a tendency to make the same error for several periods in a row.
Ideally, you would like your confidence intervals to be as narrow as possible: more precision is preferred to less. Does this mean that, when comparing alternative forecasting models for the same time series, you should always pick the one that yields the narrowest confidence intervals around forecasts? That is, should narrow confidence intervals for forecasts be considered as a sign of a "good fit?" The answer, alas, is:
If the model's assumptions are correct, the confidence intervals it yields will be realistic guides to the precision with which future observations can be predicted. If the assumptions are not correct, it may yield confidence intervals that are all unrealistically wide or all unrealistically narrow. That is to say, a bad model does not necessarily know it is a bad model, and warn you by giving extra-wide confidence intervals. (This is especially true of trend-line models, which often yield overoptimistically narrow confidence intervals for forecasts.) You need to judge whether the model is good or bad by looking at the rest of the output.
Notwithstanding these caveats, confidence intervals are indispensable, since they are usually the only estimates of the degree of precision in your coefficient estimates and forecasts that are provided by most stat packages. And, if (i) your data set is sufficiently large, and your model passes the diagnostic tests concerning the "4 assumptions of regression analysis," and (ii) you don't have strong prior feelings about what the coefficients of the variables in the model should be, then you can treat a 95% confidence interval as an approximate 95% probability interval. (In the long run.) (Return to top of page.)
In regression forecasting, you may be concerned with point estimates and confidence intervals for some or all of the following:
In all cases, there is a simple relationship between the point estimate and its surrounding confidence interval:
(Confidence Interval) = (Point Estimate) + (Critical t-value) x (Standard Deviation or Standard Error)
For a 95% confidence interval, the "critical t value" is the value that is exceeded with probability 0.025 (one-tailed) in a t distribution with n-p degrees of freedom, where p is the number of coefficients in the model--including the constant term if any. (In general, for a 100*(1-x) percent confidence interval, you would use the t value exceeded with probability x/2.) If the number of degrees of freedom is large--say, more than 30--the t distribution closely resembles the standard normal distribution, and the relevant critical t value for a 95% confidence interval is approximately equal to 2. (More precisely, it is 1.96.) In this case, therefore, the 95% confidence interval is roughly equal to the point estimate "plus or minus two standard deviations." Here is a selection of critical t values to use for different confidence intervals and different numbers of degrees of freedom, taken from a standard table of the t distribution:
Degrees of t-value
for confidence interval
Freedom (n-p) 50% 80% 90% 95%
-------------- ------ ------ ------ ------
10 0.700 1.372 1.812 2.228
20 0.687 1.325 1.725 2.086
30 0.683 1.310 1.697 2.042
60 0.679 1.296 1.671 2.000
Infinite 0.674 1.282 1.645 1.960
A t-distribution with "infinite" degrees of freedom is a standard normal distribution.
The "standard error” or “standard deviation" in the above equation depends on the nature of the thing for which you are computing the confidence interval. For the confidence interval around a coefficient estimate, this is simply the "standard error of the coefficient estimate" that appears beside the point estimate in the coefficient table. (Recall that this is proportional to the standard error of the regression, and inversely proportional to the standard deviation of the independent variable.)
For a confidence interval for the mean (i.e., the true height of the regression line), the relevant standard deviation is referred to as the "standard deviation of the mean" at that point. This quantity depends on the following factors:
Other things being equal, the standard deviation of the mean--and hence the width of the confidence interval around the regression line--increases with the standard errors of the coefficient estimates, increases with the distances of the independent variables from their respective means, and decreases with the degree of correlation between the coefficient estimates. However, in a model characterized by "multicollinearity", the standard errors of the coefficients and
For a confidence interval around a prediction based on the regression line at some point, the relevant standard deviation is called the "standard deviation of the prediction." It reflects the error in the estimated height of the regression line plus the true error, or "noise," that is hypothesized in the basic model:
DATA = SIGNAL + NOISE
In this case, the regression line represents your best estimate of the true signal, and the standard error of the regression is your best estimate of the standard deviation of the true noise. Now (trust me), for essentially the same reason that the fitted values are uncorrelated with the residuals, it is also true that the errors in estimating the height of the regression line are uncorrelated with the true errors. Therefore, the variances of these two components of error in each prediction are additive. Since variances are the squares of standard deviations, this means:
(Standard deviation of prediction)^2 = (Standard deviation of mean)^2 + (Standard error of regression)^2
Note that, whereas the standard error of the regression is a fixed number, the standard deviations of the predictions and the standard deviations of the means will usually vary from point to point in time, since they depend on the values of the independent variables. However, the standard error of the regression is typically much larger than the standard errors of the means at most points, hence the standard deviations of the predictions will often not vary by much from point to point, and will be only slightly larger than the standard error of the regression.
It is possible to compute confidence intervals for either means or predictions around the fitted values and/or around any true forecasts which may have been generated. Statgraphics and RegressIt will automatically generate forecasts rather than fitted values wherever the dependent variable is "missing" but the independent variables are not. Confidence intervals for the forecasts are also reported. Here is an example of a plot of forecasts with confidence limits for means and forecasts produced by RegressIt for the regression model fitted to the natural log of cases of 18-packs sold. If you look closely, you will see that the confidence intervals for means (represented by the inner set of bars around the point forecasts) are noticeably wider for extremely high or low values of price, while the confidence intervals for forecasts are not.
One of the underlying assumptions of linear regression analysis is that the distribution of the errors is approximately normal with a mean of zero. A normal distribution has the property that about 68% of the values will fall within 1 standard deviation from the mean (plus-or-minus), 95% will fall within 2 standard deviations, and 99.7% will fall within 3 standard deviations. Hence, a value more than 3 standard deviations from the mean will occur only rarely: less than one out of 300 observations on the average. Now, the residuals from fitting a model may be considered as estimates of the true errors that occurred at different points in time, and the standard error of the regression is the estimated standard deviation of their distribution. Hence, if the normality assumption is satisfied, you should rarely encounter a residual whose absolute value is greater than 3 times the standard error of the regression. An observation whose residual is much greater than 3 times the standard error of the regression is therefore usually called an "outlier." In the "Reports" option in the Statgraphics regression procedure, residuals greater than 3 times the standard error of the regression are marked with an asterisk (*). In the residual table in RegressIt, residuals with absolute values larger than 2.5 times the standard error of the regression are highlighted in boldface and those absolute values are larger than 3.5 times the standard error of the regression are further highlighted in red font. Outliers are also readily spotted on time-plots and normal probability plots of the residuals.
If your data set contains hundreds of observations, an outlier or two may not be cause for alarm. But outliers can spell trouble for models fitted to small data sets: since the sum of squares of the residuals is the basis for estimating parameters and calculating error statistics and confidence intervals, one or two bad outliers in a small data set can badly skew the results. When outliers are found, two questions should be asked: (i) are they merely "flukes" of some kind (e.g., data entry errors, or the result of exceptional conditions that are not expected to recur), or do they represent real and potentially repeatable events whose effects ought to be measured (either by keeping them in the model or investigating separately); and (ii) how much have the coefficients, error statistics, and predictions, etc., been affected?
An outlier may or may not have a dramatic effect on a model, depending on the amount of "leverage" that it has. Its leverage depends on the values of the independent variables at the point where it occurred: if the independent variables were all relatively close to their mean values, then the outlier has little leverage and will mainly affect the value of the estimated CONSTANT term and the standard error of the regression. However, if one or more of the independent variable had relatively extreme values at that point, the outlier may have a large influence on the estimates of the corresponding coefficients: e.g., it may cause an otherwise insignificant variable to appear significant, or vice versa.
The best way to determine how much leverage an outlier (or group of outliers) has, is to exclude it from fitting the model, and compare the results with those originally obtained. You can do this in Statgraphics by using the WEIGHTS option: e.g., if outliers occur at observations 23 and 59, and you have already created a time-index variable called INDEX, you could type:
INDEX <> 23 & INDEX <> 59
in the WEIGHTS field on the input panel, and then re-fit the model. In RegressIt you can just delete the values of the dependent variable in those rows. (Be sure to keep a copy of them, though! In this sort of exercise, it is best to copy all the values of the dependent variable to a new column, assign it a new variable name, then delete the desired values in the new column and use it as the new dependent variable.) Forecasts will automatically be generated for the excluded or missing values of the dependent variable in either program. The discrepancies between the forecasts and the actual values, measured in terms of the corresponding standard-deviations-of- predictions, provide a guide to how "surprising" these observations really were.
An alternative method, which is often used in stat packages lacking a WEIGHTS option, is to "dummy out" the outliers: i.e., add a dummy variable for each outlier to the set of independent variables. These observations will then be fitted with zero error independently of everything else, and the same coefficient estimates, predictions, and confidence intervals will be obtained as if they had been excluded outright. (However, statistics such as R-squared and MAE will be somewhat different, since they depend on the sum-of-squares of the original observations as well as the sum of squared residuals, and/or they fail to correct for the number of coefficients estimated.) In Statgraphics, to dummy-out the observations at periods 23 and 59, you could add the two variables:
INDEX = 23
INDEX = 59
to the set of independent variables on the model-definition panel. In RegressIt you could create these variables by filling two new columns with 0’s and then entering 1’s in rows 23 and 59 and assigning variable names to those columns. The estimated coefficients for the two dummy variables would exactly equal the difference between the offending observations and the predictions generated for them by the model.
If it turns out the outlier (or group thereof) does have a significant effect on the model, then you must ask whether there is justification for throwing it out. Go back and look at your original data and see if you can think of any explanations for outliers occurring where they did. Sometimes you will discover data entry errors: e.g., "2138" might have been punched instead of "3128." You may discover some other reason: e.g., a strike or stock split occurred, a regulation or accounting method was changed, the company treasurer ran off to Panama, etc. In this case, you must use your own judgment as to whether to merely throw the observations out, or leave them in, or perhaps alter the model to account for additional effects.
CAUTION: MISSING VALUES MAY CAUSE VARIATIONS IN SAMPLE SIZE
When dealing with many variables, particularly ones that may have been obtained from different sources, it is not uncommon for some of them to have missing values, often at the beginning or end (due to different amounts of history and/or the use of time transformations such as lagging and differencing), but sometimes in the middle as well. This may create a situation in which the size of the sample to which the model is fitted may vary from model to model, sometimes by a lot, as different variables are added or removed. (In general the estimation procedure will use all rows of data in which none of the currently selected variables has missing values.) You should always keep your eye on the sample size that is reported in your output, to make sure there are no surprises. Small differences in sample sizes are not necessarily a problem if the data set is large, but you should be alert for situations in which relatively many rows of data suddenly go missing when more variables are added to the model. If this does occur, then you may have to choose between (a) not using the variables that have significant numbers of missing values, or (b) deleting all rows of data in which any of the variables have missing values, so that the sample will be the same for any model that is fitted.
Another thing to be aware of in regard to missing values is that automated model selection methods such as stepwise regression base their calculations on a covariance matrix computed in advance from rows of data where all of the candidate variables have non-missing values, hence the variable selection process will overlook the fact that different sample sizes are available for different models. For this reason, the value of R-squared that is reported for a given model in the stepwise regression output may not be the same as you would get if you fitted that model by itself. (Return to top of page.)
MULTIPLICATIVE REGRESSION MODELS AND THE LOGARITHM TRANSFORMATION
The basic linear regression model assumes that the contributions of the different independent variables to the prediction of the dependent variable are additive. For example, if X1 and X2 are assumed to contribute additively to Y, the prediction equation of the regression model is:
Ŷt = b0 + b1X1t + b2X2t
Here, if X1 increases by one unit, other things being equal, then Y is expected to increase by b1 units. That is, the absolute change in Y is proportional to the absolute change in X1, with the coefficient b1 representing the constant of proportionality. Similarly, if X2 increases by 1 unit, other things equal, Y is expected to increase by b2 units. And if both X1 and X2 increase by 1 unit, then Y is expected to change by b1 + b2 units. That is, the total expected change in Y is determined by adding the effects of the separate changes in X1 and X2.
In some situations, though, it may be felt that the dependent variable is affected multiplicatively by the independent variables. This means that on the margin (i.e., for small variations) the expected percentage change in Y should be proportional to the percentage change in X1, and similarly for X2. And further, if X1 and X2 both change, then on the margin the expected total percentage change in Y should be the sum of the percentage changes that would have resulted separately. (For large variations, the percentages would be compounded, not added.) The appropriate model for this situation is the multiplicative regression model:
Ŷt = b0 (X1t ^ b1)(X2t ^ b2)
Here, Y is proportional to the product of X1 and X2, each raised to some power, whose value we can try to estimate from the data. (I am using Excel notation here, in which “^” stands for “raised to the power of.”) The coefficients b1 and b2 are referred to as the elasticities of Y with respect to X1 and X2, respectively. If either of them is equal to 1, we say that the response of Y to that variable has unitary elasticity--i.e., the expected marginal percentage change in Y is exactly the same as the percentage change in the independent variable. If the coefficient is less than 1, the response is said to be inelastic--i.e., the expected percentage change in Y will be somewhat less than the percentage change in the independent variable.
The multiplicative model, in its raw form above, cannot be fitted using linear regression techniques. However, it can be converted into an equivalent linear model via the logarithm transformation. The natural logarithm function (LOG in Statgraphics, LN in Excel and RegressIt and most other mathematical software), has the property that it converts products into sums: LOG(X1X2) = LOG(X1)+LOG(X2), for any positive X1 and X2. Also, it converts powers into multipliers: LOG(X1^b1) = b1(LOG(X1)). Using these rules, we can apply the logarithm transformation to both sides of the above equation:
LOG(Ŷt) = LOG(b0 (X1t ^ b1) + (X2t ^ b2))
= LOG(b0) + b1LOG(X1t) + b2LOG(X2t)
Thus, LOG(Y) is a linear function of LOG(X1) and LOG(X2). (See the log transformation page for a more detailed discussion of properties and uses of the log transformation.) This model can be fitted in the Statgraphics multiple-regression procedure by specifying LOG(Y) as the dependent variable and LOG(X1) and LOG(X2) as the independent variables. The estimated coefficients of LOG(X1) and LOG(X2) will represent estimates of the powers of X1 and X2 in the original multiplicative form of the model, i.e., the estimated elasticities of Y with respect to X1 and X2. The estimated CONSTANT term will represent the logarithm of the multiplicative constant b0 in the original multiplicative model. In RegressIt, the variable-transformation procedure can be used to create new variables that are the natural logs of the original variables, which can be used to fit the new model. In this case, if the variables were originally named Y, X1 and X2, they would automatically be assigned the names Y_LN, X1_LN and X2_LN.
Another situation in which the logarithm transformation may be used is in "normalizing" the distribution of one or more of the variables, even if a priori the relationships are not known to be multiplicative. It is technically not necessary for the dependent or independent variables to be normally distributed--only the errors in the predictions are assumed to be normal. However, when the dependent and independent variables are all continuously distributed, the assumption of normally distributed errors is often more plausible when those distributions are approximately normal. If some of the variables have highly skewed distributions (e.g., runs of small positive values with occasional large positive spikes), it may be difficult to fit them into a linear model yielding normally distributed errors. Scatterplots involving such variables will be very strange looking: the points will be bunched up at the bottom and/or the left (although strictly positive). And, if a regression model is fitted using the skewed variables in their raw form, the distribution of the predictions and/or the dependent variable will also be skewed, which may yield non-normal errors. In this case it may be possible to make their distributions more normal-looking by applying the logarithm transformation to them.
The log transformation is also commonly used in modeling price-demand relationships. See the beer sales model on this web site for an example. (Return to top of page.)
Go on to next topic: Stepwise and all-possible-regressions