Super-efficiency: The Nasty, Ugly Little Fact
I just read Steve Stigler’s wonderful article entitled: “The Epic Story of Maximum Likelihood.” I don’t know why I didn’t read this paper earlier. Like all of Steve’s papers, it is at once entertaining and scholarly. I highly recommend it to everyone.
As the title suggests, the paper discusses the history of maximum likelihood with a focus on Fisher’s “proof” that the maximum likelihood estimator is optimal. The “nasty, ugly little fact” is the problem of super-efficiency.
1. Hodges Example
The maximum likelihood estimator (mle) is
We’d like to be able to say that the mle is, in some sense, optimal.
The usual way we teach this, is to point out that and that any other consistent estimator must have a variance which is at least this large (asymptotically).
Hodges’ famous example shows that this is not quite right. Hodges’ estimator is:
If then eventually and hence
But if , then eventually is in the window and hence . i.e. it is equal to the true value. Thus, when , behaves like the mle. But when , it is better than the mle.
Hence, the mle is not optimal, at least, not in the sense Fisher claimed.
2. Rescuing the mle
Does this mean that the claim that the mle is optimal is doomed? Not quite. Here is a picture (from Wikipedia) of the risk of the Hodges estimator for various values of :
There is a price to pay for the small risk at : the risk for values near 0 is huge. Can we leverage the picture above into a precise statement about optimality?
First, if we look at the maximum risk rather than the pointwise risk then we see that the mle is optimal. Indeed, is the unique estimator that is minimax for all bowl-shaped estimators. See my earlier post on this.
Second, Le Cam showed that the mle is optimal among all regular estimators. These are estimators whose distribution is not affected by small changes in the parameter. This is known as Le Cam’s convolution theorem because he showed that the limiting distribution of any regular estimator is equal to the distribution of the mle plus (convolved with) another distribution. (There are, of course, regularity assumptions involved.)
Chapter 8 of van der Vaart (1998) is a good reference for these results.
3. Why Do We Care?
The idea of all of this, was not to rescue the claim that “the mle is optimal” at any cost. Rather, we had a situation where it was intuitively clear that something was true in some sense but it was difficult to make it precise.
Making the sense in which the mle is optimal precise represents an intellectual breakthrough in statistics. The deep mathematical tools that Le Cam developed have been used in many aspects of statistical theory. Two reviews of Le Cam theory can be found here and here.
That the mle is optimal seemed intuitively clear and yet turned out to be a subtle and deep fact. Are there other examples of this in Statistics and Machine Learning?
Stigler, S. (2007). The epic story of maximum likelihood. Statistical Science, 22, 598-620.
van der Vaart. (1998). Asymptotic Statistics. Cambridge.
van der Vaart, Aad. (2002). The statistical work of Lucien Le Cam. Ann. Statist., 30, 631-682.