Monthly Archives: August 2013

Statistics and Dr. Strangelove

One of the biggest embarrassments in statistics is that we don’t really have confidence bands for nonparametric functions estimation. This is a fact that we tend to sweep under the rug.

Consider, for example, estimating a density {p(x)} from a sample {X_1,\ldots, X_n}. The kernel estimator with kernel {K} and bandwidth {h} is

\displaystyle  \hat p(x) = \frac{1}{n}\sum_{i=1}^n \frac{1}{h}K\left( \frac{x-X_i}{h}\right).

Let’s start with getting a confidence interval at a single point {x}.

Let {\overline{p}(x) = \mathbb{E}(\hat p(x))} be the mean of the estimator and let {s_n(x)} be the standard deviation. (A consistent estimator of {s_n(x)} is {\sqrt{a^2/n}} where {a} is the sample standard deviation of {U_1,\ldots, U_n} where {U_i = h^{-1} K(x-X_i/h)}.)

Now,

\displaystyle  \frac{\hat{p}(x) - p(x)}{s_n(x)} = \frac{\hat{p}(x) - \overline{p}(x)}{s_n(x)} + \frac{\overline{p}(x)-p(x)}{s_n(x)} = \frac{\hat{p}(x) - \overline{p}(x)}{s_n(x)} + \frac{b(x)}{s_n(x)}

where {b(x) = \overline{p}(x)-p(x)} is the bias. By the central limit theorem, the first term is approximately standard Normal,

\displaystyle  \frac{\hat{p}(x) - \overline{p}(x)}{s_n(x)} \approx N(0,1).

This suggests the confidence interval, {\hat p(x) \pm z_{\alpha/2} s_n(x)}.

The problem is the second term {\frac{b(x)}{s_n(x)}}. Recall that the optimal bandwidth {h} balances the bias and the variance. This means that if we use the best bandwidth, then {\frac{b(x)}{s_n(x)}} behaves like a constant, that is, {\frac{b(x)}{s_n(x)}\rightarrow c} for some (unknown) constant {c}. The result is that, even as {n\rightarrow\infty}, the confidence interval is centered at {p(x) + c} instead of {p(x)}. As a result, the coverage will be less than the intended coverage {1-\alpha}.

We could in principle solve this problem by undersmoothing. If we choose a bandwidth smaller than the optimal bandwidth, then {\frac{b(x)}{s_n(x)}\rightarrow 0} and the problem disappears. But this creates two new problems. First, the confidence interval is now centered around a non-optimal estimator. Second, and more importantly, we don’t have any practical, data-driven methods for choosing an undersmoothed bandwidth. There are some proposals in the theoretical literature but they are, frankly, not very practical

Of course, all of these problems are only worse when we consider simultaneous confidence band for the whole function.

The same problem arises if we use Bayesian inference. In short, the 95 percent Bayesian interval will not have 95 percent frequentist coverage. The reasons are the same as for the frequentist interval. (One can of course “declare” that coverage is not important and then there is no problem. But I am assuming that we do want correct frequentist coverage.)

The only good solution I know is to simply admit that we can’t create a confidence interval for {p(x)}. Instead, we simply interpret {\hat p(x) \pm z_{\alpha/2} s_n(x)} as a confidence interval for {\overline{p}(x)}, which is a smoothed out (biased) version of {p(x)}:

\displaystyle  \overline{p}(x) = \int K(u) p(x+hu) \,du \approx p(x) + C h^2

where {C} is an unknown constant. In other words, we just live with the bias. (Of course, the bias gets smaller as the sample size gets larger and {h} gets smaller.)

Once we take this point of view, we can easily get a confidence band using the bootstrap. We draw {X_1^*,\ldots, X_n^*} from the empirical distribution {P_n} and compute the density estimator {\hat p^*}. By repeating this many times, we can find the quantile {z_\alpha} defined by

\displaystyle  P\Biggl( \sqrt{nh}||\hat p(x)^*-\hat p(x)||_\infty > z_\alpha \Biggm| X_1,\ldots, X_n\Biggr) = \alpha.

The confidence band is then {(\ell(x),u(x))} where

\displaystyle  \ell(x) = \hat p(x) - \frac{z_\alpha}{\sqrt{nh}},\ \ \ u(x) = \hat p(x) + \frac{z_\alpha}{\sqrt{nh}}.

We then have that

\displaystyle  P\Biggl( \ell(x) \leq \overline{p}(x) \leq u(x)\ \ \ {\rm for\ all\ }x\Biggr)\rightarrow 1-\alpha

as {n\rightarrow \infty}. Again, this is a confidence band for {\overline{p}(x)} rather than for {p(x)}.

Summary: There really is no good, practical method for getting true confidence bands in nonparametric function estimation. The reason is that the bias does not disappear fast enough as the sample size increases. The solution, I believe, is to just acknowledge that the confidence bands have a slight bias. Then we can use the bootstrap (or other methods) to compute the band. Like Dr. Strangelove, we just learn to stop worrying and love the bias.

The JSM, Minimaxity and the Language Police

The JSM, Minimaxity and the Language Police

I am back from the JSM. For those who don’t know, the JSM is the largest statistical meeting in the world. This year there were nearly 6,000 people.

Some people hate the JSM because it is too large. I love JSM. There is so much going on: lots of talks, receptions, tutorials and, of course, socializing. It’s impossible to walk 10 feet in any direction without bumping into another old friend.

1. Highlights

On Sunday I went to a session dedicated to the 70th birthday of my colleague Steve Fienberg. This was a great celebration of a good friend and remarkable statistician. Steve has more energy at 70 than I did when I was 20.

On a sadder note, I went to a memorial session for my late friend George Casella. Ed George, Bill Strawderman, Roger Berger, Jim Berger and Marty Wells talked about George’s many contributions and his huge impact on statistics. To borrow Ed’s catchphrase: what a good guy.

On Tuesday, I went to Judea Pearl’s medallion lecture, with discussions by Jamie Robins and Eric Tchetgen Tchetgen. Judea gave an unusual talk, mixing philosophy, metaphors (“eagles and snakes can’t build microscopes”) and math. Judea likes to argue that graphical models/structural equation models are the best way to view causation. Jamie and Eric argued that graphs can hide certain assumptions and that counterfactuals need to be used in addition to graphs.

Digression. There is another aspect of causation that deserves mention. It’s one thing to write down a general model to describe causation (using graphs or counterfactuals). It’s another matter altogether to construct estimators for the causal effects. Jamie Robins is well-known for his theory of “G-estimation”. At the risk of gross oversimplification, this is an approach to sequential causal modeling that eventually leads to semiparametric efficient estimators. The construction of these estimators is highly non-trivial. You can’t get them by just writing down a density for the graph and estimating the distribution. Nor can you just throw down some parametric models for the densities in the factorization implied by the graph. If you do, you will end up with a model in which there is no parameter that equals 0 when there is no causal effect. I call this the “null paradox.” And Jamie’s estimators are useful. Indeed, on Monday, there was a session called “Causal Inference in Observational Studies with Time-Varying Treatments.” All the talks were concerned with applying Jamie’s methods to develop estimators in real problems such as organ transplantation studies. The point is this: when it comes to causation, just writing down a model using graphs or counterfactuals is only a small part of the story. Finding estimators is often much more challenging.
End Digression.

On Wednesday, I had the honor of giving the Rietz Lecture (named after Henry Rietz, the first president of the Institute of Mathematical Statistics). I talked about Topological Inference. I will post the slides on my web page soon. I was fortunate to be introduced by my good friend Ed George.

2. Minimaxity

One thing that came up at my talk, but also in several discussions I had with people at the conference is the following problem. Recall that the minimax risk is

\displaystyle  R=\inf_{\hat \theta}\sup_{P\in {\cal P}}E_P[ L(\theta(P),\hat\theta)]

where {L} is a loss function, {{\cal P}} is a set of distributions and the infimum is over all estimators. Often we can define an estimator that achieves the minimax rate for a problem. But the estimator that achieves the risk {R} may not be computable in any practical sense. This happens in some of the topological problems I discussed. It also happens in sparse PCA problems. I suspect we are going to come across this issue more and more.

I suggested to a few people that we should really define the minimax risk by restricting {\hat\theta} to be an estimator that can be computed in polynomial time. But how would we then compute the minimax risk? Some people, like Mike Jordan and Martin Wainwright, have made headway in studying the tradeoff between optimality and computation. But a general theory will be hard. I did mention this to Aad van der Vart at dinner one night. If anyone can make headway with this problem, it is Aad.

3. Montreal

The conference was in Montreal. If you have never been there, Montreal is a very pleasant city; vibrant and full of good restaurants. We had a very nice time including good dinners and cigars at a classy cigar bar.

Rant: The only bad thing about Montreal is the tyrannical language police. (As a dual citizen, I think I am allowed to criticize Canada!) Put a sign on your business only in English and go to jail. Very tolerant. This raises lots interesting problems: Is “pasta” English? What if you are using it as a proper name, as in naming your business: Pasta! And unless you live in North Korea, why should anyone be able to tell you what language to put on your private business? We heard one horror story that is almost funny. A software company opened in Montreal. The language police checked an approved of their signage. (Yes. They really pay people to do this.) Then they were told they had to write their code in French. I’m not even sure what this means. Is C code English or French? Anyway, the company left Montreal.

To be fair, it’s not just Montreal that lives under the brutal rule of bureaucrats and politicians. Every country does. But Canada does lean heavily toward arbitrary (i.e. favoring special groups) rules. Shane Jensen pointed me to Can’tada which lists stuff you can’t use in Canada.
End Rant.

Despite my rant, I think Montreal is a great city. I highly recommend it for travel.

4. Summary

The JSM is lots of fun. Yes, it is a bit overwhelming but where else can you find 6000 statisticians in one place? So next year, when I say I don’t want to go, someone remind me that I actually had a good time.