NEWS (12/14/2017): The 2018 version of RegressIt, a free Excel add-in for regression analysis which runs on both PCs and Macs, is now available. It has many innovative and unique features to support and teach thoughtful data analysis, including a 40-button navigation ribbon, multidimensional audit trail, built-in teaching notes, and an interface with R. Check it out!
Standard error of the regression and other measures of error
Adjusted R-squared (not the bottom line!)
Significance of the estimated coefficients
Values of the estimated coefficients
Plots of forecasts and residuals (important!)
For a sample of output that illustrates the various topics discussed here, see the “Regression Example, part 2” page.
(i) Standard error of the regression (root-mean-squared error adjusted for degrees of freedom): Does the current regression model yield smaller errors, on average, than the best model previously fitted, and is the improvement significant in practical terms? In regression modeling, the best single error statistic to look at is the standard error of the regression, which is the estimated standard deviation of the unexplainable variations in the dependent variable. (It is approximately the standard deviation of the errors, apart from the degrees-of-freedom adjustment.) This what your software is trying to minimize when estimating coefficients, and it is a sufficient statistic for describing properties of the errors if the model’s assumptions are all correct.
Furthermore, the standard error of the regression is a lower bound on the standard error of any forecast generated from the model. In general the forecast standard error will be a little larger because it also takes into account the errors in estimating the coefficients and the relative extremeness of the values of the independent variables for which the forecast is being computed. If the sample size is large and the values of the independent variables are not extreme, the forecast standard error will be only slightly larger than the standard error of the regression. (See page 14 of these notes for more details.)
You should directly compare the standard error of the regression between models only if their units are the same and they are fitted to the same (or almost the same) sample of the same dependent variable. RegressIt provides a Model Summary Report that shows side-by-side comparisons of error measures and coefficient estimates for models fitted to the same dependent variable, in order to make such comparisons easy, although sample sizes may vary if there are missing values in any independent variables that are not included in all models.
In time series forecasting, it is common to look not only at root-mean-squared error but also the mean absolute error (MAE) and, for positive data, the mean absolute percentage error (MAPE) in evaluating and comparing model performance. The latter measures are easier for non-specialists to understand and they are less sensitive to extreme errors, if the occasional big mistake is not a serious concern. Also, it is sometimes appropriate to compare MAPE between models fitted to different samples of data, because it is a relative rather than absolute measure. These two statistics are not routinely reported by most regression software, however.
Whenever you are working with time series data, you should also ask: does the current regression model improve on the best naive (random walk or random trend) model, according to these error measures? The mean absolute scaled error statistic measures improvement in mean absolute error relative to a random-walk-without-drift model. And how has the model been doing lately? Are its most recent errors typical in size and random-looking, or are they getting bigger or more biased?
(ii) Adjusted R-squared: This is R-squared (the fraction by which the variance of the errors is less than the variance of the dependent variable) adjusted for the number of coefficients in the model relative to the sample size in order to correct it for bias (the same adjustment used in computing the standard error of the regression). That is, adjusted R-squared is the fraction by which the square of the standard error of the regression is less than the variance of the dependent variable. It is the most over-used and abused of all statistics--don't get obsessed with it. R-squared is not the bottom line. All it measures is the percentage reduction in mean-squared-error that the regression model achieves relative to the naive model "Y=constant", which may or may not be the appropriate naive model for purposes of comparison. Better to determine the best naive model first, and then compare the various error measures of your regression model (both in the estimation and validation periods) against that naive model.
Despite the fact that adjusted R-squared is a unitless statistic, there is no absolute standard for what is a "good" value. A regression model fitted to non-stationary time series data can have an adjusted R-squared of 99% and yet be inferior to a simple random walk model. On the other hand, a regression model fitted to stationarized time series data might have an adjusted R-squared of 10%-20% and still be considered useful (although out-of-sample validation would be advisable--see section (vi) below). A designed experiment looking for small but statistically significant effects in a very large sample might accept even lower values. See this page for more about these issues. (Return to top of page.)
(iii) Significance of the estimated coefficients: Are the t-statistics greater than 2 in magnitude, corresponding to p-values less than 0.05? If they are not, you should probably try to refit the model with the least significant variable excluded, which is the "backward stepwise" approach to model refinement.
Remember that the t-statistic is just the estimated coefficient divided by its own standard error. Thus, it measures "how many standard deviations from zero" the estimated coefficient is, and it is used to test the hypothesis that the true value of the coefficient is non-zero, in order to confirm that the independent variable really belongs in the model.
The p-value is the probability of observing a t-statistic that large or larger in magnitude given the null hypothesis that the true coefficient value is zero. If the p-value is greater than 0.05--which occurs roughly when the t-statistic is less than 2 in absolute value--this means that the coefficient may be only "accidentally" significant.
There's nothing magical about the 0.05 criterion, but in practice it usually turns out that a variable whose estimated coefficient has a p-value of greater than 0.05 can be dropped from the model without affecting the error measures very much--try it and see! (Return to top of page.)
(iv) Values of the estimated coefficients: In general you are interested not only in the statistical significance of an independent variable, you are also interested in its practical significance. What does it imply in real terms? What have you learned, and how should you spend your time or money? In theory, the coefficient of a given independent variable is its proportional effect on the average value of the dependent variable, others things being equal. In business and weapons-making, this is often called "bang for the buck". Such information can be very useful for decision-making if some of the independent variables are under your control, for example, the amount of a drug administered to a patient, the price of a product, or the amount of money spent on promoting it. (See this page for an example involving the effects of several prices.) Keep in mind that when sample sizes are very large, an effect that is really quite tiny (say, the marginal benefit of an expensive new medical treatment) could appear to be quite large if all you look at is its t-statistic!
In some cases the interesting hypothesis is not whether the value of a certain coefficient is equal to zero, but whether it is equal to some other value. For example, if one of the independent variables is merely the dependent variable lagged by one period (i.e., an autoregressive term), then the interesting question is whether its coefficient is equal to one. If so, then the model is effectively predicting the difference in the dependent variable, rather than predicting its level, in which case you can simplify the model by differencing the dependent variable and deleting the lagged version of itself from the list of independent variables.
Sometimes patterns in the magnitudes and signs of lagged variables are of interest. For example if both X and LAG(X,1) are included in the model, and their estimated coefficients turn out to have similar magnitudes but opposite signs, this suggests that they could both be replaced by a single DIFF(X) term. (Return to top of page.)
(v) Plots of forecasts and residuals: DO NOT FAIL TO LOOK AT PLOTS OF THE FORECASTS AND ERRORS. (Some software makes this hard: it may be necessary to execute a separate procedure or write additional code in order to produce a single plot, and even a small amount of extra work is sometimes a barrier to careful analysis.) Do the forecasts "track" the data in a satisfactory way, apart from the inevitable regression-to-the mean? (In the case of time series data, you are especially concerned with how the model fits the data at the "business end", i.e., the most recent values. An example of a very bad fit is given here.) Do the residuals appear random, or do you see some systematic patterns in their signs or magnitudes? Are they free from trends, autocorrelation, and heteroscedasticity? Are they normally distributed? There are a variety of statistical tests for these sorts of problems, but the best way to determine whether they are present and whether they are serious is to look at the pictures.
If heteroscedasticity and/or non-normality is a problem, you may wish to consider a nonlinear transformation of the dependent variable, such as logging or deflating, if such transformations are appropriate for your data. (Remember that logging converts multiplicative relationships to additive relationships, so that when you log the dependent variable, you are implicitly assuming that the relationships among the original variables are multiplicative.)
If autocorrelation is a problem, you should probably consider changing the model so as to implicitly or explicitly include lagged variables--e.g., try stationarizing the dependent and independent variables via differencing, or add lags of the dependent and/or independent variables to the regression equation, or introduce an autoregressive error correction. In Statgraphics, you can just enter DIFF(X) or LAG(X,1) as the variable name if you want to use the first difference or 1-period-lagged value of X in the input to a procedure. In RegressIt, lagging and differencing are options on the Variable Transformation menu. Of course, when working in Excel, it is possible to use formulas to create transformed variables of any kind, although there are advantages to letting the software do it for you: it makes the process more user-friendly, it reduces the possibility for error, and it makes the output self-documenting in terms of how transformed variables were created.
You do not usually rank (i.e., choose among) models on the basis of their residual diagnostic tests, but bad residual diagnostics indicate that the model's error measures may be unreliable and that there are probably better models out there somewhere. (Return to top of page.)
(vi) Out-of-sample validation: If you have enough data to hold out a sizable portion for validation and if your software offers this feature, you should compare the performance of the models in the validation as well as estimation periods. (See this page for an example of out-of-sample validation.) A good model should have small error measures in both the estimation and validation periods, compared to other models, and its validation period statistics should be similar to its own estimation period statistics. Regression models with many independent variables are especially susceptible to overfitting the data in the estimation period, so watch out for models that have suspiciously low error measures in the estimation period and disappointingly high error measures in the validation period.
If the variance of the errors in original, untransformed units is growing over time due to inflation or compound growth, then the best statistic to use for comparisons between the estimation and validation period is mean absolute percentage error, rather than mean squared error or mean absolute error.
Although the model's performance in the validation period is theoretically the best indicator of its forecasting accuracy, especially for time series data, you should be aware that the hold-out sample may not always be highly representative, especially if it is small--say, less than 20 observations. If your validation period statistics appear strange or contradictory, you may wish to experiment by changing the number of observations held out. Sometimes the inclusion or exclusion of a few unusual observations can make a big a difference in the comparative statistics of different models.
Also, be aware that if you test a large number of models and rigorously rank them on the basis of their validation period statistics, you may end up with just as much "data snooping bias" as if you had only looked at estimation-period statistics--i.e., you may end up picking a model that is more lucky than good! The best defense against this is to choose the simplest and most intuitively plausible model that gives comparatively good results. (Return to top of page.)