Super-efficiency: “The Nasty, Ugly Little Fact”

Super-efficiency: The Nasty, Ugly Little Fact

I just read Steve Stigler’s wonderful article entitled: “The Epic Story of Maximum Likelihood.” I don’t know why I didn’t read this paper earlier. Like all of Steve’s papers, it is at once entertaining and scholarly. I highly recommend it to everyone.

As the title suggests, the paper discusses the history of maximum likelihood with a focus on Fisher’s “proof” that the maximum likelihood estimator is optimal. The “nasty, ugly little fact” is the problem of super-efficiency.

1. Hodges Example

Suppose that

$\displaystyle X_1, \ldots, X_n \sim N(\theta,1).$

The maximum likelihood estimator (mle) is

$\displaystyle \hat\theta = \overline{X}_n = \frac{1}{n}\sum_{i=1}^n X_i.$

We’d like to be able to say that the mle is, in some sense, optimal.

The usual way we teach this, is to point out that ${Var(\hat\theta) = 1/n}$ and that any other consistent estimator must have a variance which is at least this large (asymptotically).

Hodges’ famous example shows that this is not quite right. Hodges’ estimator is:

$\displaystyle T_n = \begin{cases} \overline{X}_n & \mbox{if } |\overline{X}_n| \geq \frac{1}{n^{1/4}}\\ 0 & \mbox{if } |\overline{X}_n| < \frac{1}{n^{1/4}}. \end{cases}$

If ${\theta\neq 0}$ then eventually ${T_n = \overline{X}_n}$ and hence

$\displaystyle \sqrt{n}(T_n - \theta) \rightsquigarrow N(0,1).$

But if ${\theta =0}$ , then eventually ${\overline{X}_n}$ is in the window ${[-n^{-1/4},n^{-1/4}]}$ and hence ${T_n = 0}$ . i.e. it is equal to the true value. Thus, when ${\theta \neq 0}$ , ${T_n}$ behaves like the mle. But when ${\theta=0}$ , it is better than the mle.

Hence, the mle is not optimal, at least, not in the sense Fisher claimed.

2. Rescuing the mle

Does this mean that the claim that the mle is optimal is doomed? Not quite. Here is a picture (from Wikipedia) of the risk of the Hodges estimator for various values of ${n}$ :

There is a price to pay for the small risk at ${\theta=0}$ : the risk for values near 0 is huge. Can we leverage the picture above into a precise statement about optimality?

First, if we look at the maximum risk rather than the pointwise risk then we see that the mle is optimal. Indeed, ${\overline{X}_n}$ is the unique estimator that is minimax for all bowl-shaped estimators. See my earlier post on this.

Second, Le Cam showed that the mle is optimal among all regular estimators. These are estimators whose distribution is not affected by small changes in the parameter. This is known as Le Cam’s convolution theorem because he showed that the limiting distribution of any regular estimator is equal to the distribution of the mle plus (convolved with) another distribution. (There are, of course, regularity assumptions involved.)

Chapter 8 of van der Vaart (1998) is a good reference for these results.

3. Why Do We Care?

The idea of all of this, was not to rescue the claim that “the mle is optimal” at any cost. Rather, we had a situation where it was intuitively clear that something was true in some sense but it was difficult to make it precise.

Making the sense in which the mle is optimal precise represents an intellectual breakthrough in statistics. The deep mathematical tools that Le Cam developed have been used in many aspects of statistical theory. Two reviews of Le Cam theory can be found here and here.

That the mle is optimal seemed intuitively clear and yet turned out to be a subtle and deep fact. Are there other examples of this in Statistics and Machine Learning?

References

Stigler, S. (2007). The epic story of maximum likelihood. Statistical Science, 22, 598-620.

van der Vaart. (1998). Asymptotic Statistics. Cambridge.

van der Vaart, Aad. (2002). The statistical work of Lucien Le Cam. Ann. Statist., 30, 631-682.

This entry was written by normaldeviate, posted on April 5, 2013 at 1:39 pm, filed under Uncategorized. Bookmark the permalink. Follow any comments here with the RSS feed for this post. Both comments and trackbacks are currently closed.

17 Comments

K. Knight

Posted April 5, 2013 at 1:58 pm | Permalink

Not to diminish LeCam’s contributions to this area because they were absolutely fundamental, but I believe the Convolution Theorem is due to Jaroslav Hajek.
- normaldeviate
  
  Posted April 5, 2013 at 2:01 pm | Permalink
  
  You are correct of course Keith, thanks.
  I should have called it the Hajek-LeCam theorem.
  By the way, here is the wikipedia link for the theorem:
  
  http://en.wikipedia.org/wiki/H%C3%A1jek%E2%80%93Le_Cam_convolution_theorem
Jonas Wallin (@Jonas_Wallin)

Posted April 5, 2013 at 2:54 pm | Permalink

I have been working on estimation for the Generalized Laplace distribution (or Variance gamma https://en.wikipedia.org/wiki/Variance-gamma_distribution).
If \lambda < 0.5 (using the notation from the wikipedia link) the likelihood of a set of observations X_1, X_2, \ldots ,X_n is unbounded for \mu (the location parameter) at the points X_1,X_2,\ldots,X_n. So the mle is unmodfied wont work for the distribution.
- normaldeviate
  
  Posted April 5, 2013 at 3:19 pm | Permalink
  
  Yes there are many examples where maximum likelihood fails.
  The problem is to identify what it means precisely to say
  what “optimal” means and under what conditions is the mle optimal.
Entsophy

Posted April 6, 2013 at 7:48 am | Permalink

Finding the Maximum Likelihood Estimator is equivalent (in distributions from the Exponential Family) to maximizing the entropy subject some constraint and then using the method of the Lagrange Multipliers to handle the constraints. One might object that this only applies to distributions from the Exponential Family, but sense that includes all the common distributions wherein MLE is usually applied without problems, and it seems to run into trouble outside that range, then it’s enough to make you wonder if entropy isn’t the key. Maybe “Maximizing Entropy” is the relevant optimality requirement for MLE’s.
- normaldeviate
  
  Posted April 6, 2013 at 8:30 am | Permalink
  
  No. I think LeCam theory
  is the relevant optimality.
  He nailed it.
  - Entsophy
    
    Posted April 6, 2013 at 10:43 am | Permalink
    
    Well sinse they’re both thoerems I guess it’s a matter of taste who nailed it. The maximum entropy approach makes sense because what’s really going on is that by maximizing the entropy, you’re maximizing the size of high probability manifold of the distribution thereby created the greatest opportunity possible for the true value to be in the high probability manifold. Since that’s what’s really required to make reliable inferences, it’s all kinds of relevant. That’s all very Bayesian though.
    
    There’s no need to be either/or however. Have you considered the possibility that they’re related? How closely connected is the class of distributions with regular estimators to class of Maximun entropy distributions using their sufficient statistics as the estimator? Although they seem completely different, I wouldn’t be surprised if they were connected. The class of maximum entropy distributions and class of distributions with sufficient statistics seemed completely unrelated until they were proven to be identical.
  - normaldeviate
    
    Posted April 6, 2013 at 10:53 am | Permalink
    
    The exponential family is a very special case.
    The connections of LeCam theory to exponential families
    have been well understood for a very long time.
    Why not just study LeCam theory? You might find it interesting.
  - Entsophy
    
    Posted April 6, 2013 at 10:57 am | Permalink
    
    I will definitely after see this.
george

Posted April 6, 2013 at 2:40 pm | Permalink

Do you mean “bowl-shaped loss functions”? It’s a bit odd to refer to bowl-shaped estimators.
- normaldeviate
  
  Posted April 6, 2013 at 2:49 pm | Permalink
  
  Yes! Thanks
  - george
    
    Posted April 7, 2013 at 2:23 pm | Permalink
    
    Also a bit surprised not to see Stein estimators mentioned… doing better than the MLE, at least in some sense.
  - normaldeviate
    
    Posted April 7, 2013 at 2:39 pm | Permalink
    
    I was saving that for another day
  - Zach
    
    Posted April 9, 2013 at 8:28 am | Permalink
    
    Looking forward to the Stein estimators post
Keith O'Rourke

Posted April 9, 2013 at 3:04 pm | Permalink

I recall the Neyman-Scott stuff being central in Stephen’s paper – just too well known to be of interest?
- normaldeviate
  
  Posted April 9, 2013 at 5:50 pm | Permalink
  
  yes, an interesting example, indeed
  - Keith O'Rourke
    
    Posted April 12, 2013 at 7:43 am | Permalink
    
    Larry: I had read a previous draft and it turns out that the couple pages on Neyman-Scott being the main defeat of Fisher’s likelihood method has been reduced to a few comments in the published version. Apparently Stephen changed his mind, though as the most important practical problem he still goes with Neyman-Scott.

One Trackback

By The Steep Price of Sparsity « Normal Deviate on July 27, 2013 at 2:58 pm

[…] plotted the risk function of the Hodges estimator here. The risk of the mle is flat. The large peaks in the risk function of the Hodges estimator are very […]

Normal Deviate