BA 411 homework assignment #1

Homework assignment #1 (due Tuesday, September 10)

The objectives of this homework assignment are:

To get some practice in retrieving data from on-line sources and transferring it between different programs.
To become familiar with basic data exploration tools in Statgraphics
To learn the appropriate uses of data transformations such as deflation and logging
To learn to apply simple nonseasonal forecasting models and compare the performance of different models

Before beginning, you should familiarize yourself with the basic Statgraphics operations covered in "Statgraphics for Windows: Tutorial Introduction" Also, pick up the "Datadisk Quick Guide" handout for instructions on using Datadisk and refer to the notes on "Obtaining Economic Data Files from Datadisk", "Tips on Using Statgraphics for Windows," and "Time Series Forecasting in SGWIN" that were posted on the web site and handed out in class.

YOUR MISSION: First, obtain the following series from Datadisk:

Economic Series #85 (consumer price index)
Economic Series #65 (S&P 500 Index)
Economic Series #129 (housing starts)
Retail Series #66 (sales at household appliance/radio/TV/computer stores, seasonally adjusted)

Then for each of the three latter series (S&P500, housing starts, and appliance-store sales) find the best non-seasonal forecasting model from among the following three choices:

Linear trend
Random walk without growth
Random walk with growth

...perhaps in conjunction with an appropriate nonlinear transformation (e.g., logging, deflating with the CPI, or deflating at a fixed rate). You may also wish to consider, for each series, whether it is best to use all the available data (back to 1967) or instead to use only the more recent data (the last 10 or 20 years, or whatever). In all cases, hold out the last (i.e., most recent) 24 months of data to validate your models, and also generate and plot forecasts 24 months into the future.

Step 1: Download data from Datadisk. On the PC Network, Datadisk must be run from the DOS network interface. You can select the DOS network interface upon bootup, or if the computer has already been booted up, you can exit from Windows 3.1 or Windows 95, go the to MS-DOS prompt, then type RESET and choose the DOS network interface. When the dust settles, type DATADISK at the "F>"prompt to run Datadisk. Choose option 3 from the Datadisk main menu to change from one database to another; and choose option 2 to retrieve a series from the current database by number. (Alternatively, you can hit 1 to browse the table of contents, then type in the number of the desired series when you see it, and hit Enter followed by the Space bar).

As you load each series into memory, jot down its series number and its exact description, including the units in which it is measured. You may need to refer to these notes later when assigning names to the variables and writing up your analysis.

After all the series have been loaded into memory, choose option 9 to write them to a disk file. You will first be prompted to enter the letter codes of the series you wish to write to disk--hit A, B, C, etc. until all series have been selected in order. When prompted for the name of the file to write to, don't use the default "TO DATABASE" option. Instead type a descriptive file name such as BA411HW1. (You don't need to type the extension PRN--Datadisk will add this automatically.) After typing the file name and hitting Enter, just hit Enter in response to the next 5 prompts--i.e., use all the default options for file type. After the file is written to disk, you can exit from Datadisk and then either switch to the Windows Network Interface for further analysis or else take your data diskette home to complete the analysis there. If you continue your analysis in the PC lab, I suggest that you reboot the PC rather than using the RESET command to switch to the Windows 3.1 interface.

Step 2: Use Excel to clean up the data file. Follow the instructions in the notes on "How to shovel data around" to import your data file into Excel and clean it up for later use in Statgraphics. First, you will want to edit the variable names in the first row of the file so that they will be valid variable names in Statgraphics--I suggest you use descriptive names such as CPI, SP500, HOUSTART, and APPSALES.

Note: the first line of the file created by Datadisk is a long string of text containing all the variable descriptions, truncated to 8 letters and individually enclosed in quotes. Here's a tricky way to get the variable descriptions aligned with the right columns of data. When you first open the file in Excel, specify that it is delimited with (only) the space as the delimiting character. (Do not specify yet that the quote mark is also a delimiter, contrary to what is recommended in the other Datadisk handout.) When the file is first imported, the long string with the variable names will be in cell A1. Now use the "cut" and "copy" operations to move the contents of cell A1 to cell B1. (Column A will now be empty, and you can delete it.) Next highlight cell B1 by itself, choose the Data/Text-to-Columns command, and specify that the text in cell B1 is delimited (only) with the quote mark (") as the delimiter. The truncated variable names will then be parsed out into the rest of the cells in row 1, above their respective columns of data. You may still need to edit the names to remove embedded spaces or otherwise make them valid and descriptive variable names--and your handwritten notes from step 1 may come in handy here--but this task should now be easier than it would have been otherwise.

Next you should delete all rows in which any variables have missing values--in particular, delete all data rows at the beginning of the file prior to January 1967 and delete all rows at the end where the data runs out, which will be either in mid-1995 or mid-1996, depending on whether the database has been updated by the time you do this. (Note that the missing-value code in Datadisk is -9999.) Save the cleaned-up file as a tab-delimited text file (if you are using SGWIN version 1.4 in the lab) or as an Excel XLS file (if you are using SGWIN version 2.1 at home).

Step 3: Import the data into Statgraphics. Within Statgraphics, simply use the File/Open command to open the file you saved in Excel. Specify the file type (tab-delimited or Excel) and specify that the missing-value code is -9999. To make sure that the file was imported successfully, you should immediately take a look at the Statgraphics data spreadsheet by double-clicking the spreadsheet icon at the bottom of the screen--scroll all the way to the bottom of the spreadsheet to make sure everything looks OK. If all is well, you should then immediately save the file again as a Statgraphics data file (i.e., a *.SF file).

Note: if you see any -9999's on the Statgraphics data spreadsheet, or if your graphs subsequently come out looking like horizontal and vertical lines, then you may have forgotten to specify that -9999 was the missing-value code when you opened the file in Statgraphics. If so, you should try opening the original file again.

Step 4: Exploratory analysis. In Statgraphics, go to the Special/Time_Series menu and begin by using the Descriptive Methods procedure to draw some plots of each series. What kind of trends or growth patterns do you observe? Experiment with transformations of the series such as logging and deflating in order to determine which transformation, if any, seems to be most appropriate for linearizing the growth pattern and stabilizing the variance. You should generate time series plots of your series and transformations thereof, and you may also wish to plot first difference of the (transformed) series and its autocorrelations.

Note: on the input screen for the Descriptive Methods procedure, you are asked to specify the sampling interval, the start date, and the seasonality (number of periods per season). If you have followed the instructions above to obtain data (only) from 1967 onward, you should specify that the interval is monthly, that the start date is 1/67, and the seasonality is 12. To deflate by the CPI, you must divide the input variable by the CPI variable--e.g., you would type Y/CPI as the input variable if Y is the name of the series you want to analyze and CPI is the name of the variable containing the Consumer Price Index. If, on the other hand, you wish to use a log transformation or deflate at a fixed rate, you can do this by using the buttons on the "Analysis Options" panel for the Descriptive Methods procedure. (To get the Analysis Options panel for the Descriptive Methods procedure, click on the Analysis Summary report for this procedure and then hit the right mouse button. Differencing transformations can also be specified on the Analysis Options panel.)

Also, note that when you first enter the Descriptive Methods procedure, as with any Statgraphics procedure, you initially see only a relatively unhelpful Analysis Summary report on the screen. You must use the buttons on the Analysis Window toolbar to select the other reports and graphs that can be displayed. Remember that the first button on the toolbar takes you back to the input screen where you specify the variable to be analyzed, the second button gives you a selection of text reports, and the third button gives you a selection of graphs. In this case, you should display the horizontal and vertical time series plots and the autocorrelation plot. Don't bother with the partial autocorrelation plot or the periodogram plots. (If you are using version 2.1, you can ask the "StatAdvisor" to explain what the other plots do. Single-click on a graph and then click the StatAdvisor icon--the mortarboard in the upper right of the screen. You will then see an explanation of what is shown in the graph.)

One more thing; when the horizontal time series plot is first displayed, it may plot the data with points instead of lines. Click the right mouse button and choose "Pane Options," then specify "Lines" rather than "Points."

Step 5: Comparison of forecasting models. After you have explored the data and reached a tentative decision about what kind of transformation (if any) ought to be used and what kind of forecasting models seem plausible, you should go to the Forecasting procedure on the Special/Time_Series menu. The input screen for this procedure is similar to the input screen in Descriptive methods: you must specify the variable name (divided by CPI, if necessary), sampling interval, and seasonality. Also, here you must specify how many observations at the end of the series are to be held out for validation, and how many forecasts should be extrapolated into the future. For the purposes of this assignment, you should hold out 24 observations for validation and also generate 24 forecasts for the future. (See the notes on "Three kinds of forecasts" for more details on the validation process.)

When you first enter the Forecasting procedure, it fits an assortment of 5 default models: random walk (without growth), mean, linear trend, simple exponential smoothing, and simple moving average with 5 terms. To view the specification of the default models, click the right mouse button and choose "Analysis Options." You will then see the "Model Specification" panel, which includes the same transformations that were available in Descriptive Methods plus many additional fields for specifying elements of a forecasting model. In the upper left are five "radio buttons" (labelled A through E) that control the selection of five models that may be compared side-by-side. By clicking on different radio buttons you can view and change the specifications of the different models. Two of the three models you are supposed to analyze in this assignment are already specified (as models A and C, respectively: the random walk and linear trend models). You should also analyze a random walk model with growth, which you may wish to define as model E. To do this, click the E button and then click the ARIMA model option and the specify that the order of nonseasonal differencing is 1, the orders of all the AR, MA, SAR, and SMA terms are zero (the default setting is AR=1, which you must change to AR=0), and the model does include a constant. Once all of your models have been specified, you should look at the Analysis Summary report for each model, plots of the forecasts and residuals of each model, and the Model Comparison report that compares the estimation-period and validation-period statistics of all models. (Most of the reports and graphs refer to the model whose radio button was pushed last. To change from one model to another, just click the right mouse button, choose "Analysis Options," hit the radio button for the desired model, and click OK. All the reports and graphs will then be redrawn for the chosen model.)

Note: if your model's forecasts into the future do not show up on the forecast plot at first, it may be because the default scaling of the time axis on the graph is wrong. For example, the time axis may only extend to 1994, whereas the data and forecasts may extend farther into the future. (This often happens, alas.) To rectify the situation, click on one of the scale numbers on the time axis of the graph and then hit the right mouse button. A dialog box will then pop up with options for graph scaling, in which you can change the ending year for the time scale.

What to look for in the results: For each of your three time series, your objective is to find the forecasting model which is "best" in the sense of being more accurate and also (hopefully) more intuitively reasonable than the other models. The ideal properties of a "best" model are the following:

It should have the smallest--or nearly the smallest--average errors in the estimation period, as measured by RMSE, MAE, and MAPE. (Don't split hairs, though: a reduction in MAPE from 20 to 15 is a significant improvement in accuracy, but a reduction from 15.0 to 14.9 probably is not, particularly if it comes at the expense of increased model complexity.)

It should also have the smallest errors in the validation period, and ideally the errors in the validation period should be similar in magnitude to those in the estimation period. (This suggests that you have not overfitted the data.) The model's performance in the validation period is the real proof of the pudding.

The plot of its forecasts (extrapolated into the future) should look intuitively reasonable--i.e., it should agree with your "eyeball" extrapolation of the time series.

The plot of the residuals (errors) should look like stationary noise--i.e., no trend , no significant increase in variance from beginning to end, no horrendous outliers, and no significant autocorrelations at any lags

The residuals should pass most or all of the tests for randomness and goodness of fit (these are summarized in the Model Comparison report and more details are provided in the Residual Diagnostics report).

You should be able to explain in plain English how your model works--i.e., what assumptions is it making about patterns and trends in the past data that are expected to reoccur in the future?

What to hand in: Your writeup should include a concise executive summary which explains (a) what you learned about each of the time series from your analysis, (b) why you chose the forecasting models that you did; and (c) what kind of growth your model predicts for each series in the next two years, and the degree of uncertainty surrounding this growth projection, in terms that a layman would understand. Feel free to comment on what you feel might be the underlying causes for the patterns you have observed. Attach or insert the relevant reports and graphs produced by SGWIN. For each of the three time series, you should definitely include the Model Comparison report that compares your various models for it, and as supporting evidence you should also include for each model:

A copy of the Analysis Summary report
A plot of the forecasts superimposed on the original series (in particular, you should make a detailed plot of the most recent five years of data and the extrapolation two years into the future by clicking on the time axis scale and resetting the starting and ending years appropriately)
A plot of the residuals versus time
A plot of the residual autocorrelations

Altogether you should report results for a total of nine models: three fitted to each time series. (Presumably all three models fitted to the same time series will use the same nonlinear transformation--such as logging or deflating--if any was deemed appropriate at the exploratory phase of the analysis. It is not necessary to fit a complete battery of forecasting models to every possible nonlinear transformation of the original data. Try to decide early in your analysis which transformation, if any, is most appropriate, and stick with it.)