Linearization property

Positivity requirement and choice of base

First difference of natural log ≈ percentage change

The poor man's deflator

Trend measured in natural-log units ≈ percentage growth

Errors measured in natural-log units ≈ percentage errors

Coefficients in log-log regressions ≈ proportional
percentage changes

**Linearization
property: **The logarithm function has the
defining property that LOG (X*Y) = LOG(X) + LOG(Y)--i.e., the logarithm of a
product equals the sum of the logarithms. Therefore, logging tends to convert *multiplicative*
relationships to *additive *relationships, and it tends to convert *exponential
*(compound growth) trends to* linear *trends. By taking logarithms of
variables which are multiplicatively related and/or growing exponentially over
time, we can often explain their behavior with linear models. For example, here
is a graph of LOG(AUTOSALE). Notice that the log
transformation converts the exponential growth pattern to a linear growth
pattern, and it simultaneously converts the multiplicative
(proportional-variance) seasonal pattern to an additive (constant-variance)
seasonal pattern. (Compare this with the original graph of AUTOSALE.)

**Positivity
requirement and choice of base:** The logarithm transformation can be applied only to data
which are *strictly positive*--you can't take the log of zero or a
negative number! Also, there are two kinds of logarithms in standard use:
"natural" logarithms and base-10 logarithms. The only difference
between the two is a scaling constant, which is not really important for
modeling purposes. In Statgraphics, the LOG function is the *natural *log,
and its inverse is the EXP function. (EXP(Y) is the natural logarithm base,
2.718..., raised to the Yth power.) The base-10
logarithm and its inverse are LOG10 and EXP10 in Statgraphics. However, in
standard mathematical notation, and in Excel and most other analytic software,
the natural logarithm function is written as LN instead, and LOG stands instead
for the base-10 logarithm. For example, in Excel the expression =LN(X) computes
the natural log of X.

The **natural
logarithm** and its base number *e* have some
magical properties, which you may remember from calculus. For example, the function *e*^{X}* * is its own derivative, and the
derivative of LN(X) is 1/X. But for purposes of
business analysis, its great advantage is that *small changes in the natural
log of a variable are directly interpretable as percentage changes in the
original variable*, to a very close approximation*.* The reason for this is that the graph of
Y = LN(X) passes through the point (1, 0) and has a slope of 1 there,
so it is tangent to the straight line whose equation is Y = X-1 (the dashed line
in the plot below):

This
property of the natural log function implies that

**LN(1+r) ≈ r**

when
r is much smaller than 1 in magnitude.
Why is this important?
Suppose* *X increases by a small
percentage, such as 5%. This means
that it changes from X* *to X(1+r),
where r = 0.05. Now observe:

**LN(X (1+r)) =
LN(X) + LN(1+r) ≈
LN(X) + r**

Thus,
when X is increased by 5%, i.e., multiplied by a factor of 1.05, the natural
log of X changes from LN(X) to LN(X) + 0.05, to a very close
approximation. Thus, increasing X
by 5% is equivalent to adding 0.05 to LN(X).

From
now on I will refer to changes in natural logarithms as
“diff-logs.” (In
Statgraphics, the diff-log transformation of X is literally DIFF(LOG(X)).) The following table shows the exact
correspondence for percentages in the range from -50% to +100%:

As you can see, percentage changes and
diff-logs are almost exactly the same within the range +/- 5%, and they remain
very close up to +/- 20%. For large
percentage changes they begin to diverge in an asymmetric way. Note that the diff-log that corresponds
to a 50% decrease is ‑0.693 while the diff-log of a 100% increase is
+0.693, exactly the opposite number.
This reflects the fact that a 50% decrease followed by a 100% increase
(or vice versa) takes you back to the same spot.

**First
difference of natural log ≈ percentage change: **When
used in conjunction with differencing, logging converts absolute differences
into relative (i.e., percentage) differences, when these differences are small.
Thus, in Statgraphics, the series DIFF(LOG(Y)) represents the *percentage
change* in Y from period to period. Strictly speaking, the percentage change
in Y at period t is defined as (Y(t)-Y(t-1))/Y(t-1), which is only *approximately*
equal to LOG(Y(t)) - LOG(Y(t-1)), but the approximation is almost exact* *if
the percentage change is small. In Statgraphics terms, this means that
DIFF(Y)/LAG(Y,1) is virtually identical to DIFF(LOG(Y)). If you don't believe
me, here's a plot of the percent change in auto sales versus the first
difference of its logarithm, zooming in on the last 5 years. The blue and
red lines are virtually indistinguishable except at the highest and lowest
points.

If the
situation is one in which the percentage changes are potentially large enough
for this approximation to be inaccurate, it is better to use log units rather
than percentage units, because this takes compounding into account in a
systematic way.

(Return to top of page.)

**The
poor man's deflator:**
Logging a series often has an effect very similar to deflating: it dampens
exponential growth patterns and reduces heteroscedasticity (i.e., stabilizes
variance). Logging is therefore a "poor man's deflator" which does
not require any external data (or any head-scratching about which price index
to use). Logging is not *exactly* the same as deflating--it does not *eliminate*
an upward trend in the data--but it can straighten the trend out so that it can
be better fitted by a linear model. (Compare the logged auto
sales graph with the deflated auto sales
graph.)

If you're
going to log the data and then fit a model that implicitly or explicitly uses *differencing*
(e.g., a random walk, exponential smoothing, or ARIMA model), then it is
usually redundant to deflate by a price index, as long as the rate of inflation
changes only slowly: the percentage change measured in nominal dollars will be
nearly the same as the percentage change in constant dollars. Mathematically
speaking, DIFF(LOG(Y/CPI)) is nearly identical DIFF(LOG(Y)): the only
difference between the two is a very faint amount of noise due to fluctuations
in the inflation rate. To demonstrate this point, here's a graph of the first
difference of logged auto sales, with and without deflation:

By logging
*rather* than deflating, you avoid the need to incorporate an *explicit*
forecast of future inflation into the model: you merely lump inflation together
with any other sources of steady compound growth in the original data. Logging
the data before fitting a random walk model yields a so-called **geometric
random walk**--i.e., a random walk with geometric rather than linear growth.
A geometric random walk is the default forecasting model that is commonly used
for stock price data. (Return to top of page.)

**Trend
measured in natural-log units ≈ percentage growth: ** Because changes in the natural logarithm are (almost) equal
to *percentage* changes in the original series, it follows that the slope
of a trend line fitted to logged data is equal to the average* percentage*
growth in the original series. For example, in the graph of LOG(AUTOSALE) shown above, if you "eyeball" a
trend line you will see that the magnitude of logged auto sales increases by
about 2.5 (from 1.5 to 4.0) over 25 years, which is an average increase of
about 0.1 per year, i.e., 10% per year. It is much easier to
estimate this trend from the logged graph than from the original unlogged
one! The 10% figure obtained here is *nominal* growth, including
inflation. If we had instead eyeballed a trend line on a plot of logged *deflated*
sales, i.e., LOG(AUTOSALE/CPI), its slope would be the average *real *percentage
growth.

Usually the trend is estimated more precisely by fitting a statistical model
that explicitly includes a local or global trend parameter, such as a linear
trend or random-walk-with-drift or linear exponential smoothing model.
When a model of this kind is fitted in conjunction with a log transformation,
its trend parameter can be interpreted as a percentage growth rate.

**Errors
measured in natural-log units ≈ percentage errors:** Another interesting property of the logarithm is that
errors in predicting the logged series can be interpreted as percentage errors
in predicting the original series, albeit the percentages are relative to the
forecast values, not the actual values. (Normally one interprets the
"percentage error" to be the error expressed as a percentage of the
actual value, not the forecast value, athough the statistical properties of
percentage errors are usually very similar regardless of whether the
percentages are calculated relative to actual values or forecasts.)

Thus, if
you use least-squares estimation to fit a linear forecasting model to *logged
*data, you are implicitly minimizing mean squared *percentage* error,
rather than mean squared error in the original units--which is probably a good
thing if the log transformation was appropriate in the first place. And if you
look at the error statistics in logged units, you can interpret them as
percentages. For example, the standard deviation of the errors in predicting a
logged series is essentially the standard deviation of the percentage errors in
predicting the original series, and the mean absolute error (MAE) in predicting
a logged series is essentially the mean absolute percentage error (MAPE) in
predicting the original series.

**Coefficients
in log-log regressions ≈ proportional percentage changes****: ** In many economic situations (particularly
price-demand relationships), the relationship between variables is linear in
terms of *percentage* changes rather
than *absolute* changes. In such cases, applying a natural log or
diff-log transformation to both dependent and independent variables may be
appropriate. This issue will be
discussed in more detail in the regression chapter of these notes.

**Statgraphics tip**: In the Forecasting procedure in Statgraphics, the error statistics
shown on the Model Comparison report are all in *untransformed* (i.e.,
original) units to facilitate a comparison among models, regardless of whether
they have used different transformations. (This is a very useful feature
of the Forecasting procedure--in most stat software it is hard to get a
head-to-head comparison of models with and without a log transformation.)
However, whenever a regression model or an ARIMA model is fitted in conjunction
with a log transformation, the standard-error-of-the-estimate or
white-noise-standard-deviation statistics on the Analysis Summary report refer
to the transformed (logged) errors, in which case they are essentially the RMS*
percentage* errors. (Return
to top of page.)