Confidence Intervals for Misbehaved Functionals

Larry Wasserman

Suppose you want to estimate some quantity which is a function of some unknown distribution . In other words, where maps distributions to real numbers. For example, could denote the median of .

Ideally, a confidence interval based on an iid sample should satisfy

for every distribution . The hard part is finding a non-trivial confidence interval that actual satisfies this condition for every . You can always set but that’s an example of a trivial confidence interval.

In some cases (such as the median) it is possible to find a non-trivial confidence interval. In some cases (such as the mean, when the sample space is the whole real line) it is impossible. Today I want to discuss a paper by David Donoho (Donoho 1988) that discusses, some in-between cases. In these problems, there exist non-trivial, one-sided confidence intervals.

**1. Difficult Functionals **

Let denote all distributions on the real line. A functional is a map . Consider the following functional: is the number of modes of the density of . (If does not have a density we can define where denotes convolved with a Normal with mean 0 and standard deviation .)

Now look at these two plots:

The density one the left has 2 modes so . The density one the right has 1,000 modes so . Wait! You don’t see 1,000 modes on the second density? The reason is that they are very, very tiny. It’s possible to increase the number of modes drastically without changing the distribution very much. That is, we can find arbitrarily close to but such that . Functionals with this property are very difficult to estimate. In fact, as Donoho proves, no non-trivial two-sided confidence interval exists for such functionals.

More formally, suppose that takes values in . Then we’ll say that is difficult if

where

and

Donoho calls this the *dense graph condition*. Figure 2 of Donoho’s paper explains everything in a nice picture. Basically, if is difficult, you can increase by changing slightly but you can’t decrease it. (Think of the mode example.)

He then proves the following theorem:

**Theorem:** [Donoho, 1988] Let be a difficult functional. Let . If

then

In words, if has a non-trivial upper bound for some , then it is has coverage probability 0.

**Digression:** Rob Tibshirani and I independently proved a similar result. (Tibshirani and Wasserman 1988). We called these bad functionals, *sensitive parameters*. At the time I was a graduate student at the University of Toronto and Rob was a brand new faculty member there. Just before our paper went to press, David’s paper came out. We managed to add some reference to him when we got the galley proofs. For some reason, the typesetter decided to take every mention of “Donoho” and change it to “Donohue” without consulting us. Thus, our paper has several references to a mysterious person named Donohue. **End Digression.**

Other examples of difficult functionals are norms , the entropy and the Fisher information .

**2. One Sided Intervals **

The good news is that we can still say something about the functional . Construct a confidence set for the distribution function. Let be the smallest value of as varies in . We then have that

That is, we get a non-trivial one-sided confidence interval. So we can’t upper bound the number of modes but we can say things like: the 95 percent confidence interval rules out 4 or fewer modes.

What makes this work is that you can’t decrease without changing so much that it becomes statistically distinguishable from the original distribution.

Pretty, pretty cool.

**3. Higher Dimensions **

Things get rougher in high dimensions. Even in two-dimensions there are problems. Think of a two-dimensional distribution with two well separated modes. So . Now let be identical to except that we add a very thin ridge connecting the two modes. This turns 2 modes into 1 mode. Then but we have decreased . So in this case, even one-sided inference is not possible.

It might be possible to modify so that we can get non-trivial confidence intervals. For example, perhaps we can define in such a way so that, in this last example, is still considered to have 2 modes. This would be a nice project for a graduate student.

**4. References **

Donoho, D.L. (1988). One-sided inference about functionals of a density. *The Annals of Statistics*, 16, 1390-1420.

Tibshirani, R. and Wasserman, L.A. (1988). Sensitive parameters. *Canadian Journal of Statistics*, 16, 185-192.

## 4 Comments

Thanks for this; what I didn’t have before was the nice intuition why in 2-d one even can’t get a one-sided confidence interval for the number of modes. Pretty unsettling for cluster analysist who think that modes correspond to clusters…

By the way, off topic but regarding an earlier entry and the discussion of your paper in Mayo’s blog: I still try to get my head around what the result on individual sequences actually means. I still somehow think it’s either useless or black magic…

I don’t think it is black magic or useless.

It’s like an oracle inequality.

Here is an analogy: with high probability,

least squares gives you a predictor which is

close to the best linear predictor.

But you are only comparing yourself

to the set of linear predictors.

They might all be bad.

—LW

OK, I think I got it now. No black magic for sure. Useless only as long as either all participating predictors are rubbish, or where some reasonable assumptions grant that one can do much better than the worst case bound. (In many situations “low assumptions” mean “low quality”…)

Regarding your last point about a different definition of T so that it can be lower-bounded with high confidence:

It would seem that the obvious definition for T would be to consider the minimum number of modes in the set of all 2D densities that are similar to the original distribution P.

This would essentially incorporate Donoho’s result on one-sided CI’s into the definition of the functional.