Baseline Modeling

Baseline Modeling of Scanner Data

Professor John M. McCann
Fuqua School of Business
Duke University

March 20, 1995

Marketing can be defined as understanding and influencing markets . Marketers use several means to understand markets: personal observations, discussions with participants in the markets, and the analysis and interpretation of information collected from the markets. I have coined the term marketmetrics to describe the latter: understanding markets via measurements and analysis. This book provides an introduction to marketmetrics, with emphasis on the measurements themselves and a visual form of analysis.

Data that flow from the normal operation of a UPC scanner provide a portion of the data that we refer to as retail market data. The scanner give us information on unit sales and dollar sales during a fixed time period. For instance, the scanner data may be available for weekly intervals because the data in each store was collected and organized on a weekly basis.

Scanner Operation

This section contains a brief description of the operation of a typical scanner installation in a retail store.

Each checkout register is connected to a database that contains information about each UPC-coded item in the store. This information is of two types: static and dynamic. Static information refers to facts about each item carried in the store, the manufacturer, size, etc. Dynamic information refers to facts about the item that vary over time, such as its current price and the number of units that have been sold since the database was last initialized. It is common for a chain to transfer data from each store to a host computer on a regular basis, such as weekly. In this case, we refer to the week as the data reporting interval, or simply as the data interval. When a week's worth of data are transferred, the database is made ready for the next week, i.e., it is initialized.

When a clerk passes an item over a UPC scanner, the item's code is read and passed to the database, where two things happen. First, the item's price is retrieved from the database and passed to the register where it is used to calculate the shopper's bill. Second, the fact that the item was sold is entered into the database by adding "1" to the total unit sales of the item. In addition, the database contains information about the total dollar sales of the item. This piece of information is also updated to reflect the purchase since the beginning of the data reporting period. For a weekly interval, the unit sales entry in the database would refer to the amount of the item that had been sold since the beginning of the week.

Time-Variation of Retail Sales

Understanding the time-variation of retail sales data is an area of concern for almost all marketing managers. What is the item doing over time, and what seems to be driving it? These are the types of questions that frequently occur. The following is a graph of volume versus time (in weeks) for an item in a market. For example, the data could measure the weekly retail unit volume of an item such as 32 ounce Gatorade in a market such as Chicago.

We can see that there is considerable fluctation in the data, with several spikes. These spikes were caused by retail promotions. Some of them are symmetrical in the sense that a spike's rise and fall occur at about the same rates; others are asymmetrical, with a sharp rise followed by a slow fall. Some of the peaks are higher than others; why are they not all the same? And some do not have a sharp peak, but bounce around after the sharp rise. The spikes seem to rise out of a relatively constant base, i.e., volume is at about the same levels between the peaks ... about 4,000 units. But the volume level after the last peak is higher than that after the other peaks. What could cause this deviation from the normal pattern?

We can think of the data in the graph as occuring during two types of time periods: those periods when a promotion was present and those when a promotion was not present. We have used the term baseline to refer to sales during the non-promotion period. We can also think about the baseline extending into the promotion periods, where it represents the level of sales that would have occurred if there had not been a promotion. Since we did not observe such sales levels (because they never actually occurred in the market), we must estimate them with a model or some algorithm.

One simple way is to identify the weeks before and after the promotion, and then connect these points with a horizontal line. This graph shows such connections. It should be obvious from this graph that this method of estimating baseline is, at best, an approximation. Such baseline estimation is a topic of considerable debate and is not likely to be resolved soon.

We will spend considerable time understanding the nature of these types of data. We will see that some of the questions can be answered by examining related data, while others can only be answered by understanding the level of aggregation in the data. We will come to appreciate the importance of aggregation. In fact, I consider data aggregation to be the single most important barrier to understanding marketing data.