Monthly Archives: April 2013

The Perils of Hypothesis Testing … Again

A few months ago I posted about John Ioannidis’ article called “Why Most Published Research Findings Are False.”

Ioannidis is once again making news by publishing a similar article aimed at neuroscientists. This paper is called “Power failure: why small sample size undermines the reliability of neuroscience.” The paper is written by Button, Ioannidis, Mokrysz, Nosek, Flint, Robinson and Munafo.

When I discussed the first article, I said that his points were correct but hardly surprising. I thought it was fairly obvious that {P(A|H_0) \neq P(H_0|A)} where {A} is the event that a result is declared significant and {H_0} is the event that the null hypothesis is true. But the fact that the paper had such a big impact made me realize that perhaps I was too optimistic. Apparently, this fact does need to be pointed out.

The new paper has basically the same message although the emphasis is on the dangers of low power. Let us assume that for a fraction of studies {\pi}, the null is actually false. That is {P(H_0) = 1-\pi}. Let {\gamma} be the power. Then the probability of a false discovery, assuming we reject {H_0} when the p-value is less than {\alpha}, is

\displaystyle  P(H_0|A) = \frac{ P(A|H_0) P(H_0)}{ P(A|H_0) P(H_0)+ P(A|H_1) P(H_1)} = \frac{\alpha (1-\pi)}{\alpha (1-\pi)+ \gamma \pi}.

Let us suppose, for the sake of illustration that {\pi = 0.1} (most nulls are true). Then the probability of a false discovery (using {\alpha} = 0.05) looks like this as a function of power:


So indeed, if the power is low, the chance of a false discovery is high. (And things are worse if we include the effects of bias.)

The authors go on to estimate the typical neuroscience studies. They conclude that the typical power is between .08 and .31. I applaud them for trying to come up with some estimate of the typical power but I doubt that the estimate is very reliable.

The paper concludes with a number of sensible recommendations such as: performing power calculations before doing a study, disclosing methods transparently and so on. I wish they had included one more recommendation: focus less on testing and more on estimation.

So, like the first paper, I am left with the feeling that this message, too, is correct, but not surprising. But I guess that these points are not so obvious to many users of statistics. In that case, papers like these serve an important function.

Data Science: The End of Statistics?

Data Science: The End of Statistics?

As I see newspapers and blogs filled with talk of “Data Science” and “Big Data” I find myself filled with a mixture of optimism and dread. Optimism, because it means statistics is finally a sexy field. Dread, because statistics is being left on the sidelines.

The very fact that people can talk about data science without even realizing there is a field already devoted to the analysis of data — a field called statistics — is alarming. I like what Karl Broman says:

When physicists do mathematics, they don’t say they’re doing “number science”. They’re doing math.

If you’re analyzing data, you’re doing statistics. You can call it data science or informatics or analytics or whatever, but it’s still statistics.

Well put.

Maybe I am just pessimistic and am just imagining that statistics is getting left out. Perhaps, but I don’t think so. It’s my impression that the attention and resources are going mainly to Computer Science. Not that I have anything against CS of course, but it is a tragedy if Statistics gets left out of this data revolution.

Two questions come to mind:

1. Why do statisticians find themselves left out?

2. What can we do about it?

I’d like to hear your ideas. Here are some random thoughts on these questions. First, regarding question 1.

  1. Here is a short parable: A scientist comes to a statistician with a question. The statistician responds by learning the scientific background behind the question. Eventually, after much thinking and investigation, the statistician produces a thoughtful answer. The answer is not just an answer but an answer with a standard error. And the standard error is often much larger than the scientist would like.

    The scientist goes to a computer scientist. A few days later the computer scientist comes back with spectacular graphs and fast software.

    Who would you go to?

    I am exaggerating of course. But there is some truth to this. We statisticians train our students to be slow and methodical and to question every assumption. These are good things but there is something to be said for speed and flashiness.

  2. Generally, speaking, statisticians have limited computational skills. I saw a talk a few weeks ago in the machine learning department where the speaker dealt with a dataset of size 10 billion. And each data point had dimension 10,000. It was very impressive. Few statisticians have the skills to do calculations like this.

On to question 2. What do we do about it?

Whining won’t help. We can complain that that “data scientists” are ignoring biases, not computing standard errors, not stating and checking assumption and so on. No one is listening.

First of all, we need to make sure our students are competitive. They need to be able to do serious computing, which means they need to understand data structures, distributed computing and multiple programming languages.

Second, we need to hire CS people to be on the faculty in statistics department. This won’t be easy: how do we create incentives for computer scientists to take jobs in statistics departments?

Third, statistics needs a separate division at NSF. Simply renaming DMS (Division of Mathematical Sciences) as has been debated, isn’t enough. We need our own pot of money. (I realize this isn’t going to happen.)

To summarize, I don’t really have any ideas. Does anyone?

Super-efficiency: “The Nasty, Ugly Little Fact”

Super-efficiency: The Nasty, Ugly Little Fact

I just read Steve Stigler’s wonderful article entitled: “The Epic Story of Maximum Likelihood.” I don’t know why I didn’t read this paper earlier. Like all of Steve’s papers, it is at once entertaining and scholarly. I highly recommend it to everyone.

As the title suggests, the paper discusses the history of maximum likelihood with a focus on Fisher’s “proof” that the maximum likelihood estimator is optimal. The “nasty, ugly little fact” is the problem of super-efficiency.

1. Hodges Example

Suppose that

\displaystyle  X_1, \ldots, X_n \sim N(\theta,1).

The maximum likelihood estimator (mle) is

\displaystyle  \hat\theta = \overline{X}_n = \frac{1}{n}\sum_{i=1}^n X_i.

We’d like to be able to say that the mle is, in some sense, optimal.

The usual way we teach this, is to point out that {Var(\hat\theta) = 1/n} and that any other consistent estimator must have a variance which is at least this large (asymptotically).

Hodges’ famous example shows that this is not quite right. Hodges’ estimator is:

\displaystyle  T_n = \begin{cases} \overline{X}_n & \mbox{if } |\overline{X}_n| \geq \frac{1}{n^{1/4}}\\ 0 & \mbox{if } |\overline{X}_n| < \frac{1}{n^{1/4}}. \end{cases}

If {\theta\neq 0} then eventually {T_n = \overline{X}_n} and hence

\displaystyle  \sqrt{n}(T_n - \theta) \rightsquigarrow N(0,1).

But if {\theta =0}, then eventually {\overline{X}_n} is in the window {[-n^{-1/4},n^{-1/4}]} and hence {T_n = 0}. i.e. it is equal to the true value. Thus, when {\theta \neq 0}, {T_n} behaves like the mle. But when {\theta=0}, it is better than the mle.

Hence, the mle is not optimal, at least, not in the sense Fisher claimed.

2. Rescuing the mle

Does this mean that the claim that the mle is optimal is doomed? Not quite. Here is a picture (from Wikipedia) of the risk of the Hodges estimator for various values of {n}:


There is a price to pay for the small risk at {\theta=0}: the risk for values near 0 is huge. Can we leverage the picture above into a precise statement about optimality?

First, if we look at the maximum risk rather than the pointwise risk then we see that the mle is optimal. Indeed, {\overline{X}_n} is the unique estimator that is minimax for all bowl-shaped estimators. See my earlier post on this.

Second, Le Cam showed that the mle is optimal among all regular estimators. These are estimators whose distribution is not affected by small changes in the parameter. This is known as Le Cam’s convolution theorem because he showed that the limiting distribution of any regular estimator is equal to the distribution of the mle plus (convolved with) another distribution. (There are, of course, regularity assumptions involved.)

Chapter 8 of van der Vaart (1998) is a good reference for these results.

3. Why Do We Care?

The idea of all of this, was not to rescue the claim that “the mle is optimal” at any cost. Rather, we had a situation where it was intuitively clear that something was true in some sense but it was difficult to make it precise.

Making the sense in which the mle is optimal precise represents an intellectual breakthrough in statistics. The deep mathematical tools that Le Cam developed have been used in many aspects of statistical theory. Two reviews of Le Cam theory can be found here and here.

That the mle is optimal seemed intuitively clear and yet turned out to be a subtle and deep fact. Are there other examples of this in Statistics and Machine Learning?


Stigler, S. (2007). The epic story of maximum likelihood. Statistical Science, 22, 598-620.

van der Vaart. (1998). Asymptotic Statistics. Cambridge.

van der Vaart, Aad. (2002). The statistical work of Lucien Le Cam. Ann. Statist., 30, 631-682.