Confidence Intervals for Misbehaved Functionals
Larry Wasserman
Suppose you want to estimate some quantity which is a function of some unknown distribution
. In other words,
where
maps distributions to real numbers. For example,
could denote the median of
.
Ideally, a confidence interval
based on an iid sample
should satisfy
for every distribution . The hard part is finding a non-trivial confidence interval that actual satisfies this condition for every
. You can always set
but that’s an example of a trivial confidence interval.
In some cases (such as the median) it is possible to find a non-trivial confidence interval. In some cases (such as the mean, when the sample space is the whole real line) it is impossible. Today I want to discuss a paper by David Donoho (Donoho 1988) that discusses, some in-between cases. In these problems, there exist non-trivial, one-sided confidence intervals.
1. Difficult Functionals
Let denote all distributions on the real line. A functional is a map
. Consider the following functional:
is the number of modes of the density
of
. (If
does not have a density we can define
where
denotes
convolved with a Normal with mean 0 and standard deviation
.)
Now look at these two plots:
The density one the left has 2 modes so
. The density
one the right has 1,000 modes so
. Wait! You don’t see 1,000 modes on the second density? The reason is that they are very, very tiny. It’s possible to increase the number of modes drastically without changing the distribution very much. That is, we can find
arbitrarily close to
but such that
. Functionals with this property are very difficult to estimate. In fact, as Donoho proves, no non-trivial two-sided confidence interval exists for such functionals.
More formally, suppose that takes values in
. Then we’ll say that
is difficult if
where
and
Donoho calls this the dense graph condition. Figure 2 of Donoho’s paper explains everything in a nice picture. Basically, if is difficult, you can increase
by changing
slightly but you can’t decrease it. (Think of the mode example.)
He then proves the following theorem:
Theorem: [Donoho, 1988] Let be a difficult functional. Let
. If
then
In words, if has a non-trivial upper bound for some
, then it is has coverage probability 0.
Digression: Rob Tibshirani and I independently proved a similar result. (Tibshirani and Wasserman 1988). We called these bad functionals, sensitive parameters. At the time I was a graduate student at the University of Toronto and Rob was a brand new faculty member there. Just before our paper went to press, David’s paper came out. We managed to add some reference to him when we got the galley proofs. For some reason, the typesetter decided to take every mention of “Donoho” and change it to “Donohue” without consulting us. Thus, our paper has several references to a mysterious person named Donohue. End Digression.
Other examples of difficult functionals are norms
, the entropy
and the Fisher information
.
2. One Sided Intervals
The good news is that we can still say something about the functional . Construct a
confidence
set for the distribution function. Let
be the smallest value of
as
varies in
. We then have that
That is, we get a non-trivial one-sided confidence interval. So we can’t upper bound the number of modes but we can say things like: the 95 percent confidence interval rules out 4 or fewer modes.
What makes this work is that you can’t decrease without changing
so much that it becomes statistically distinguishable from the original distribution.
Pretty, pretty cool.
3. Higher Dimensions
Things get rougher in high dimensions. Even in two-dimensions there are problems. Think of a two-dimensional distribution with two well separated modes. So
. Now let
be identical to
except that we add a very thin ridge connecting the two modes. This turns 2 modes into 1 mode. Then
but we have decreased
. So in this case, even one-sided inference is not possible.
It might be possible to modify so that we can get non-trivial confidence intervals. For example, perhaps we can define
in such a way so that, in this last example,
is still considered to have 2 modes. This would be a nice project for a graduate student.
4. References
Donoho, D.L. (1988). One-sided inference about functionals of a density. The Annals of Statistics, 16, 1390-1420.
Tibshirani, R. and Wasserman, L.A. (1988). Sensitive parameters. Canadian Journal of Statistics, 16, 185-192.

4 Comments
Thanks for this; what I didn’t have before was the nice intuition why in 2-d one even can’t get a one-sided confidence interval for the number of modes. Pretty unsettling for cluster analysist who think that modes correspond to clusters…
By the way, off topic but regarding an earlier entry and the discussion of your paper in Mayo’s blog: I still try to get my head around what the result on individual sequences actually means. I still somehow think it’s either useless or black magic…
I don’t think it is black magic or useless.
It’s like an oracle inequality.
Here is an analogy: with high probability,
least squares gives you a predictor which is
close to the best linear predictor.
But you are only comparing yourself
to the set of linear predictors.
They might all be bad.
—LW
OK, I think I got it now. No black magic for sure. Useless only as long as either all participating predictors are rubbish, or where some reasonable assumptions grant that one can do much better than the worst case bound. (In many situations “low assumptions” mean “low quality”…)
Regarding your last point about a different definition of T so that it can be lower-bounded with high confidence:
It would seem that the obvious definition for T would be to consider the minimum number of modes in the set of all 2D densities that are similar to the original distribution P.
This would essentially incorporate Donoho’s result on one-sided CI’s into the definition of the functional.