Confidence Intervals for Misbehaved Functionals

Confidence Intervals for Misbehaved Functionals
Larry Wasserman

Suppose you want to estimate some quantity {\theta} which is a function of some unknown distribution {P}. In other words, {\theta = T(P)} where {T} maps distributions to real numbers. For example, {T(P)} could denote the median of {P}.

Ideally, a {1-\alpha} confidence interval {C_n} based on an iid sample {Y_1,\ldots, Y_n} should satisfy

\displaystyle  P^n(\theta\in C_n) \geq 1-\alpha

for every distribution {P}. The hard part is finding a non-trivial confidence interval that actual satisfies this condition for every {P}. You can always set {C_n=(-\infty,\infty)} but that’s an example of a trivial confidence interval.

In some cases (such as the median) it is possible to find a non-trivial confidence interval. In some cases (such as the mean, when the sample space is the whole real line) it is impossible. Today I want to discuss a paper by David Donoho (Donoho 1988) that discusses, some in-between cases. In these problems, there exist non-trivial, one-sided confidence intervals.

1. Difficult Functionals

Let {{\cal P}} denote all distributions on the real line. A functional is a map {T:{\cal P}\rightarrow \mathbb{R}}. Consider the following functional: {T(P)} is the number of modes of the density {p} of {P}. (If {P} does not have a density we can define {T(P) = \lim_{h\rightarrow 0} T(P\star \Phi_h)} where {P\star \Phi_h} denotes {P} convolved with a Normal with mean 0 and standard deviation {h}.)

Now look at these two plots:

The density {p} one the left has 2 modes so {T(P)=2}. The density {q} one the right has 1,000 modes so {T(Q)=1,000}. Wait! You don’t see 1,000 modes on the second density? The reason is that they are very, very tiny. It’s possible to increase the number of modes drastically without changing the distribution very much. That is, we can find {Q} arbitrarily close to {P} but such that {T(Q) > T(P)}. Functionals with this property are very difficult to estimate. In fact, as Donoho proves, no non-trivial two-sided confidence interval exists for such functionals.

More formally, suppose that {T} takes values in {{\cal T}\subset \mathbb{R}}. Then we’ll say that {T} is difficult if

\displaystyle  {\rm graph}(T)\ {\rm is\ dense\ in\ }{\rm epigraph}(T)


\displaystyle  {\rm graph}(T) = \{ (P,T(P)):\ P\in {\cal P}\}


\displaystyle  {\rm epigraph}(T) = \{ (P,t)):\ P\in {\cal P}\ {\rm and}\ t \geq T(P)\}.

Donoho calls this the dense graph condition. Figure 2 of Donoho’s paper explains everything in a nice picture. Basically, if {T} is difficult, you can increase {T(P)} by changing {P} slightly but you can’t decrease it. (Think of the mode example.)

He then proves the following theorem:

Theorem: [Donoho, 1988] Let {T} be a difficult functional. Let {B= \sup \{t: t\in {\cal T}\}}. If

\displaystyle  \sup_P P(B\notin C_n)=1


\displaystyle  \inf_P P(T(P)\in C_n) =0.

In words, if {C_n} has a non-trivial upper bound for some {P}, then it is has coverage probability 0.

Digression: Rob Tibshirani and I independently proved a similar result. (Tibshirani and Wasserman 1988). We called these bad functionals, sensitive parameters. At the time I was a graduate student at the University of Toronto and Rob was a brand new faculty member there. Just before our paper went to press, David’s paper came out. We managed to add some reference to him when we got the galley proofs. For some reason, the typesetter decided to take every mention of “Donoho” and change it to “Donohue” without consulting us. Thus, our paper has several references to a mysterious person named Donohue. End Digression.

Other examples of difficult functionals are {L_q} norms {T(P)=\left(\int |p^{(k)}|^p \right)^{1/q}}, the entropy {T(P) = \int p \log p} and the Fisher information {T(P) = \int (p')^2/p}.

2. One Sided Intervals

The good news is that we can still say something about the functional {T(P)}. Construct a {1-\alpha} confidence {A_n} set for the distribution function. Let {c_n} be the smallest value of {T(P)} as {P} varies in {A_n}. We then have that

\displaystyle  \inf_P P( T(P) \in [c_n,\infty)) \geq 1-\alpha.

That is, we get a non-trivial one-sided confidence interval. So we can’t upper bound the number of modes but we can say things like: the 95 percent confidence interval rules out 4 or fewer modes.

What makes this work is that you can’t decrease {T(P)} without changing {P} so much that it becomes statistically distinguishable from the original distribution.

Pretty, pretty cool.

3. Higher Dimensions

Things get rougher in high dimensions. Even in two-dimensions there are problems. Think of a two-dimensional distribution {P} with two well separated modes. So {T(P)=2}. Now let {Q} be identical to {P} except that we add a very thin ridge connecting the two modes. This turns 2 modes into 1 mode. Then {Q\approx P} but we have decreased {T(P)}. So in this case, even one-sided inference is not possible.

It might be possible to modify {T} so that we can get non-trivial confidence intervals. For example, perhaps we can define {T} in such a way so that, in this last example, {Q} is still considered to have 2 modes. This would be a nice project for a graduate student.

4. References

Donoho, D.L. (1988). One-sided inference about functionals of a density. The Annals of Statistics, 16, 1390-1420.

Tibshirani, R. and Wasserman, L.A. (1988). Sensitive parameters. Canadian Journal of Statistics, 16, 185-192.


  1. Christian Hennig
    Posted August 13, 2012 at 2:06 pm | Permalink

    Thanks for this; what I didn’t have before was the nice intuition why in 2-d one even can’t get a one-sided confidence interval for the number of modes. Pretty unsettling for cluster analysist who think that modes correspond to clusters…

    By the way, off topic but regarding an earlier entry and the discussion of your paper in Mayo’s blog: I still try to get my head around what the result on individual sequences actually means. I still somehow think it’s either useless or black magic…

    • Posted August 13, 2012 at 3:25 pm | Permalink

      I don’t think it is black magic or useless.
      It’s like an oracle inequality.
      Here is an analogy: with high probability,
      least squares gives you a predictor which is
      close to the best linear predictor.
      But you are only comparing yourself
      to the set of linear predictors.
      They might all be bad.

      • Christian Hennig
        Posted August 15, 2012 at 1:23 pm | Permalink

        OK, I think I got it now. No black magic for sure. Useless only as long as either all participating predictors are rubbish, or where some reasonable assumptions grant that one can do much better than the worst case bound. (In many situations “low assumptions” mean “low quality”…)

  2. Martin Azizyan
    Posted August 20, 2012 at 1:36 am | Permalink

    Regarding your last point about a different definition of T so that it can be lower-bounded with high confidence:

    It would seem that the obvious definition for T would be to consider the minimum number of modes in the set of all 2D densities that are similar to the original distribution P.

    This would essentially incorporate Donoho’s result on one-sided CI’s into the definition of the functional.

%d bloggers like this: