Super-efficiency: “The Nasty, Ugly Little Fact”

Super-efficiency: The Nasty, Ugly Little Fact

I just read Steve Stigler’s wonderful article entitled: “The Epic Story of Maximum Likelihood.” I don’t know why I didn’t read this paper earlier. Like all of Steve’s papers, it is at once entertaining and scholarly. I highly recommend it to everyone.

As the title suggests, the paper discusses the history of maximum likelihood with a focus on Fisher’s “proof” that the maximum likelihood estimator is optimal. The “nasty, ugly little fact” is the problem of super-efficiency.

1. Hodges Example

Suppose that

\displaystyle  X_1, \ldots, X_n \sim N(\theta,1).

The maximum likelihood estimator (mle) is

\displaystyle  \hat\theta = \overline{X}_n = \frac{1}{n}\sum_{i=1}^n X_i.

We’d like to be able to say that the mle is, in some sense, optimal.

The usual way we teach this, is to point out that {Var(\hat\theta) = 1/n} and that any other consistent estimator must have a variance which is at least this large (asymptotically).

Hodges’ famous example shows that this is not quite right. Hodges’ estimator is:

\displaystyle  T_n = \begin{cases} \overline{X}_n & \mbox{if } |\overline{X}_n| \geq \frac{1}{n^{1/4}}\\ 0 & \mbox{if } |\overline{X}_n| < \frac{1}{n^{1/4}}. \end{cases}

If {\theta\neq 0} then eventually {T_n = \overline{X}_n} and hence

\displaystyle  \sqrt{n}(T_n - \theta) \rightsquigarrow N(0,1).

But if {\theta =0}, then eventually {\overline{X}_n} is in the window {[-n^{-1/4},n^{-1/4}]} and hence {T_n = 0}. i.e. it is equal to the true value. Thus, when {\theta \neq 0}, {T_n} behaves like the mle. But when {\theta=0}, it is better than the mle.

Hence, the mle is not optimal, at least, not in the sense Fisher claimed.

2. Rescuing the mle

Does this mean that the claim that the mle is optimal is doomed? Not quite. Here is a picture (from Wikipedia) of the risk of the Hodges estimator for various values of {n}:

hodges2

There is a price to pay for the small risk at {\theta=0}: the risk for values near 0 is huge. Can we leverage the picture above into a precise statement about optimality?

First, if we look at the maximum risk rather than the pointwise risk then we see that the mle is optimal. Indeed, {\overline{X}_n} is the unique estimator that is minimax for all bowl-shaped estimators. See my earlier post on this.

Second, Le Cam showed that the mle is optimal among all regular estimators. These are estimators whose distribution is not affected by small changes in the parameter. This is known as Le Cam’s convolution theorem because he showed that the limiting distribution of any regular estimator is equal to the distribution of the mle plus (convolved with) another distribution. (There are, of course, regularity assumptions involved.)

Chapter 8 of van der Vaart (1998) is a good reference for these results.

3. Why Do We Care?

The idea of all of this, was not to rescue the claim that “the mle is optimal” at any cost. Rather, we had a situation where it was intuitively clear that something was true in some sense but it was difficult to make it precise.

Making the sense in which the mle is optimal precise represents an intellectual breakthrough in statistics. The deep mathematical tools that Le Cam developed have been used in many aspects of statistical theory. Two reviews of Le Cam theory can be found here and here.

That the mle is optimal seemed intuitively clear and yet turned out to be a subtle and deep fact. Are there other examples of this in Statistics and Machine Learning?

References

Stigler, S. (2007). The epic story of maximum likelihood. Statistical Science, 22, 598-620.

van der Vaart. (1998). Asymptotic Statistics. Cambridge.

van der Vaart, Aad. (2002). The statistical work of Lucien Le Cam. Ann. Statist., 30, 631-682.

17 Comments

  1. K. Knight
    Posted April 5, 2013 at 1:58 pm | Permalink

    Not to diminish LeCam’s contributions to this area because they were absolutely fundamental, but I believe the Convolution Theorem is due to Jaroslav Hajek.

  2. Posted April 5, 2013 at 2:54 pm | Permalink

    I have been working on estimation for the Generalized Laplace distribution (or Variance gamma https://en.wikipedia.org/wiki/Variance-gamma_distribution).
    If \lambda < 0.5 (using the notation from the wikipedia link) the likelihood of a set of observations X_1, X_2, \ldots ,X_n is unbounded for \mu (the location parameter) at the points X_1,X_2,\ldots,X_n. So the mle is unmodfied wont work for the distribution.

    • Posted April 5, 2013 at 3:19 pm | Permalink

      Yes there are many examples where maximum likelihood fails.
      The problem is to identify what it means precisely to say
      what “optimal” means and under what conditions is the mle optimal.

  3. Entsophy
    Posted April 6, 2013 at 7:48 am | Permalink

    Finding the Maximum Likelihood Estimator is equivalent (in distributions from the Exponential Family) to maximizing the entropy subject some constraint and then using the method of the Lagrange Multipliers to handle the constraints. One might object that this only applies to distributions from the Exponential Family, but sense that includes all the common distributions wherein MLE is usually applied without problems, and it seems to run into trouble outside that range, then it’s enough to make you wonder if entropy isn’t the key. Maybe “Maximizing Entropy” is the relevant optimality requirement for MLE’s.

    • Posted April 6, 2013 at 8:30 am | Permalink

      No. I think LeCam theory
      is the relevant optimality.
      He nailed it.

      • Entsophy
        Posted April 6, 2013 at 10:43 am | Permalink

        Well sinse they’re both thoerems I guess it’s a matter of taste who nailed it. The maximum entropy approach makes sense because what’s really going on is that by maximizing the entropy, you’re maximizing the size of high probability manifold of the distribution thereby created the greatest opportunity possible for the true value to be in the high probability manifold. Since that’s what’s really required to make reliable inferences, it’s all kinds of relevant. That’s all very Bayesian though.

        There’s no need to be either/or however. Have you considered the possibility that they’re related? How closely connected is the class of distributions with regular estimators to class of Maximun entropy distributions using their sufficient statistics as the estimator? Although they seem completely different, I wouldn’t be surprised if they were connected. The class of maximum entropy distributions and class of distributions with sufficient statistics seemed completely unrelated until they were proven to be identical.

      • Posted April 6, 2013 at 10:53 am | Permalink

        The exponential family is a very special case.
        The connections of LeCam theory to exponential families
        have been well understood for a very long time.
        Why not just study LeCam theory? You might find it interesting.

      • Entsophy
        Posted April 6, 2013 at 10:57 am | Permalink

        I will definitely after see this.

  4. george
    Posted April 6, 2013 at 2:40 pm | Permalink

    Do you mean “bowl-shaped loss functions”? It’s a bit odd to refer to bowl-shaped estimators.

    • Posted April 6, 2013 at 2:49 pm | Permalink

      Yes! Thanks

      • george
        Posted April 7, 2013 at 2:23 pm | Permalink

        Also a bit surprised not to see Stein estimators mentioned… doing better than the MLE, at least in some sense.

      • Posted April 7, 2013 at 2:39 pm | Permalink

        I was saving that for another day

      • Zach
        Posted April 9, 2013 at 8:28 am | Permalink

        Looking forward to the Stein estimators post

  5. Keith O'Rourke
    Posted April 9, 2013 at 3:04 pm | Permalink

    I recall the Neyman-Scott stuff being central in Stephen’s paper – just too well known to be of interest?

    • Posted April 9, 2013 at 5:50 pm | Permalink

      yes, an interesting example, indeed

      • Posted April 12, 2013 at 7:43 am | Permalink

        Larry: I had read a previous draft and it turns out that the couple pages on Neyman-Scott being the main defeat of Fisher’s likelihood method has been reduced to a few comments in the published version. Apparently Stephen changed his mind, though as the most important practical problem he still goes with Neyman-Scott.

One Trackback

  1. By The Steep Price of Sparsity « Normal Deviate on July 27, 2013 at 2:58 pm

    […] plotted the risk function of the Hodges estimator here. The risk of the mle is flat. The large peaks in the risk function of the Hodges estimator are very […]