National Bureau of Economic Research, Cambridge, MA 02123

There is considerable dissatisfaction with traditional measures of performance evaluation. There is plenty of evidence that the Capital Asset Pricing Model has serious short comings. It is somewhat unsatisfactory to compare investment managers to their peers - the investment manager could look good just because the group as a whole looks bad.

John Graham and Campbell Harvey have developed some simple new performance metrics based
on their 1994 NBER working paper,
``Market timing ability and volatility implied
in investment newsletters' asset allocation recommendations", their
1996 *Journal of Financial Economics* paper (Volume 42, pp. 397-422)
[View PDF, 2.2mb]
and their 1997
*Financial Analysts Journal* article
[View PDF, 1.8mb].

These measures have attracted considerable attention from investment managers and the business press. See Forbes ``The Graham-Harvey Test" June 19, 1995.

ORDER EXCEL VERSION OF PERFORMANCE MEASURES

Note students and Duke alumni need to send me a separate email verifying their academic status to receive the software.

One approach to performance evaluation is to
plot the mean and the standard deviation of each fund using traditional mean-variance analysis.
Using this approach, we can assess where each fund lies compared to a
benchmark. From the graph, we can tell if one fund is "dominated" by the benchmark (i.e.
fund has lower return and higher volatility). However, it is more difficult to discern
performance if the fund has lower return and lower volatility. We start our analysis with the
following intuition. To make the fund and the benchmark comparable, we can (1) *lever
up/down the benchmark to match the fund's volatility (so we can compare apples with
apples)* or (2)* lever up/down the fund to match the volatility of the benchmark
(here
all funds are compared on the same basis).*

The idea of our Graham-Harvey ``Measure 1" (GH1) is to lever or unlever the S&P 500 futures to have the exact same volatility as the fund over the evaluation period. GH1 is the difference between the fund return and the return on the volatility-matched futures portfolio. Figure 1 details the geometry of the measure. In the figure, a strategy that unlevers the S&P 500 (by combining the S&P 500 with the Treasury bill to match the volatility of Fund A) has a much higher return than Fund A. Hence, GH1 for Fund A is negative indicating underperformance. Fund B achieves greater performance than a levered S&P futures position and receives a positive GH1. The intuition is simple. If the investor had a target level of volatility equal to Fund A, then the investor would have been much better off holding a fixed weight combination of S&P 500 futures and Treasury bills than holding the fund (or implementing the funds portfolio recommendations every month).

Our Graham-Harvey "Measure 2" (GH2) is related but different. In this measure, we lever up or down the fund's recommended investment strategy (using a Treasury bill), so that the strategy has exactly the same volatility as the S&P 500. Figure 2 shows the geometry of this measure. If Fund A is levered up to achieve the same volatility as the S&P 500 over the evaluation period, it has a lower average return than a simple buy-and-hold in the S&P 500. Hence, the GH2 measure is negative. In contrast, if we lever Fund B downwards (by combining the fund strategy with a cash investment) to achieve the same volatility as the S&P 500, the unlevered fund return is greater than the buy-and-hold S&P 500 and the performance measure is positive. For Fund B, investors would have been better off acting on the fund recommendations compared to a buy-and-hold strategy.

The measures provide different perspectives. Over the evaluation period, Measure 1 just draws an efficient frontier using the S&P and cash and checks to see if the fund lies above or below this constructed frontier. The volatility matching approach displayed in Figure 1 compares the fund return to that for a volatility-matched benchmark over the exact same sample period. Measure 2 compares all funds to a common level of volatility - the S&P 500 buy-and-hold volatility. All funds are on the same footing with GH2. The only potential disadvantage of GH2 is that it assumes the investor has the ability to lever an investment fund return to have the same volatility as the market.

To see the difference from another perspective, suppose there exists a fund with a purely random strategy that switches between 200% long in the market and 200% short in the market. Also, suppose that the return from this random strategy happens to be one percent above the risk free rate. If the CAPM beta is zero, then the alpha is 1% and this strategy would be identified as superior. In contrast, GH1 would find a portfolio of S&P 500 futures and cash that has identical realized variance. This strategy would likely have twice the variance of the market. Hence, the random strategy would be compared to a buy and hold portfolio with double the variance of the market. The random strategy would be a significant underperformer according to the GH1 measure.

Our performance measures focus on long-run performance. We demonstrate that there is a direct link between market timing and long-run performance. In particular, Measure 1 compares the returns on the fund portfolios whose weights change through time with the returns on a constant-weight portfolio with equal volatility. We show that Measure 1 consists of two components, each of which has a direct link to market timing: (i) covariance between equity weights and market returns, and (ii) a factor that penalizes changes in equity weights that do not time the market.

It is best to think of this in terms of market timing. The idea of market timing is to reduce equity exposure before market declines and to increase exposure before market rallies. The successful timer's average return should be greater than the return on the constant-weight portfolio. Indeed, ignoring the cash returns, the following expression should be positive for a successful market timer:

where w _{i} represents fund i's equity weights and rm is the market equity return. The
first
term is the average fund performance where w _{i} is changing through time as
recommendations change. The second term represents a return on a constant weight strategy
where the constant is just the average market exposure of the fund. However, (1) is just
the definition of the covariance between weights and market returns. A positive covariance
defines successful market timing [weights increasing (decreasing) during market rallies
(declines)]. By definition, a positive covariance implies that the variable weight fund
strategy has a higher average return than the constant weight strategy. Thus, the component
of Measure 1 that measures the covariance between equity weights and market returns is a
direct measure of market timing.

The second component of Measure 1 penalizes managers for changes in equity weights that do not time the market. To see why this makes sense, notice that the variance of a manager's returns has two sources (when returns and weights are uncorrelated): the variance of equity returns and the variance of the weights. That is, a manager that randomly changes weights induces volatility into its portfolio returns simply by changing the equity weights. The component of Measure 1 that penalizes strategies that are changing weights for the wrong reasons essentially says, "if you are changing weights and given that it is obvious that random weight changes contribute to variance, then you better be changing weights to achieve a higher return, i.e. you better be timing the market."

What could cause a fund to change equity weights? Given a level of risk aversion, weights would change if (i) the manager believes there are time-varying expected returns; and/or (ii) if the manager believes there are time-varying market volatilities. Let's concentrate on (i). With time-varying expected returns, the manager would increase (decrease) weights when the expected market returns is above (below) the average expected return. A random shift in the weight may increase volatility. However, a carefully planned shift in the weights to time the market, may not. Further, expected returns should increase if the investor has some ability to detect time-varying expected market returns. That is, if you are changing weights in a way which successfully times the market, then it is possible to increase average returns and perhaps even decrease variance.

In terms of the familiar mean-variance analysis, the successful market timer should be above the efficient frontier when he increases market exposure before upturns and decreases market exposure during downturns. Remember, that the mean-variance frontier is traditionally drawn with fixed investment weights. So even though changing weights are contributing to variance, the positive covariance between the weights and the market returns could actually decrease fund volatility for the successful market timer. Thus, a successful market timer will have a positive value for Measure 1, which indicates that he lies above the efficient frontier. The successful timer will lie above the frontier because 1) there is a positive covariance between his equity weights and the market, and/or he is not unduly penalized for changes in equity weights that do not time the market.

We have applied our ideas to evaluate the strategies of market timing investment newsletters. In principle, our metrics could be applied to any well diversified managed portfolio.

Forbes featured an analysis of our measure applied to 326 mutual funds. Using our measure 1, Forbes identified a collection of funds that outearn the market and yield a lower volatility. Also, Forbes' funds dominate those selected by using just the Sharpe ratio.This evidence suggests that performance is persistent. In addition, there appears to be a benefit to using our measures to assess future performance based on past performance.

We propose two measures for evaluating the performance of asset allocation recommendations from investment funds. The idea of our first measure is to compare the fund's return to a portfolio of S&P 500 futures and cash which has the same volatility over the evaluation period. The benchmark has fixed investment weights. Most fund strategies have variable weights. Presumably, if the manager is successfully timing the market (increase weights before market upticks and decrease weights before market downticks), the manager should be able to outperform this passive benchmark.

A second measure volatility-adjusts the fund's strategy. We construct a portfolio of the fund return and a Treasury bill that has exactly the same volatility as the S&P 500. The difference between the returns on the volatility-adjusted strategy and the S&P 500 defines Measure 2.

ORDER EXCEL VERSION OF PERFORMANCE MEASURES

Note students and Duke alumni need to send me a separate email verifying their academic status to receive the software.

However, the assumption that M-square make about about the zero correlation of the interest return and the asset being evaluated is only true if the maturity of the cash instrument exactly coincides with the evaluation period. Indeed, it is reasonably well known that there is a negative correlation between the interest rate changes and both stock and bond returns. Further relative to GH2, the assumption could result in misleading inference about the performance of low volatility funds where substantial leverage is needed to achieve the S&P 500 volatility. In a sample of well diversified funds, this is not an issue. However, in applying GH2 to a broader class of asset returns, certain issues arise. For example, substantial leverage would have to be employed to lever a money market fund to achieve the volatility of the S&P 500.

The following movie demonstrates the differences between the GH2 and the M-squared measure. Depending on the level of correlation, the M-squared measure could be highly misleading. View video.