Statistics and Dr. Strangelove

One of the biggest embarrassments in statistics is that we don’t really have confidence bands for nonparametric functions estimation. This is a fact that we tend to sweep under the rug.

Consider, for example, estimating a density {p(x)} from a sample {X_1,\ldots, X_n}. The kernel estimator with kernel {K} and bandwidth {h} is

\displaystyle  \hat p(x) = \frac{1}{n}\sum_{i=1}^n \frac{1}{h}K\left( \frac{x-X_i}{h}\right).

Let’s start with getting a confidence interval at a single point {x}.

Let {\overline{p}(x) = \mathbb{E}(\hat p(x))} be the mean of the estimator and let {s_n(x)} be the standard deviation. (A consistent estimator of {s_n(x)} is {\sqrt{a^2/n}} where {a} is the sample standard deviation of {U_1,\ldots, U_n} where {U_i = h^{-1} K(x-X_i/h)}.)

Now,

\displaystyle  \frac{\hat{p}(x) - p(x)}{s_n(x)} = \frac{\hat{p}(x) - \overline{p}(x)}{s_n(x)} + \frac{\overline{p}(x)-p(x)}{s_n(x)} = \frac{\hat{p}(x) - \overline{p}(x)}{s_n(x)} + \frac{b(x)}{s_n(x)}

where {b(x) = \overline{p}(x)-p(x)} is the bias. By the central limit theorem, the first term is approximately standard Normal,

\displaystyle  \frac{\hat{p}(x) - \overline{p}(x)}{s_n(x)} \approx N(0,1).

This suggests the confidence interval, {\hat p(x) \pm z_{\alpha/2} s_n(x)}.

The problem is the second term {\frac{b(x)}{s_n(x)}}. Recall that the optimal bandwidth {h} balances the bias and the variance. This means that if we use the best bandwidth, then {\frac{b(x)}{s_n(x)}} behaves like a constant, that is, {\frac{b(x)}{s_n(x)}\rightarrow c} for some (unknown) constant {c}. The result is that, even as {n\rightarrow\infty}, the confidence interval is centered at {p(x) + c} instead of {p(x)}. As a result, the coverage will be less than the intended coverage {1-\alpha}.

We could in principle solve this problem by undersmoothing. If we choose a bandwidth smaller than the optimal bandwidth, then {\frac{b(x)}{s_n(x)}\rightarrow 0} and the problem disappears. But this creates two new problems. First, the confidence interval is now centered around a non-optimal estimator. Second, and more importantly, we don’t have any practical, data-driven methods for choosing an undersmoothed bandwidth. There are some proposals in the theoretical literature but they are, frankly, not very practical

Of course, all of these problems are only worse when we consider simultaneous confidence band for the whole function.

The same problem arises if we use Bayesian inference. In short, the 95 percent Bayesian interval will not have 95 percent frequentist coverage. The reasons are the same as for the frequentist interval. (One can of course “declare” that coverage is not important and then there is no problem. But I am assuming that we do want correct frequentist coverage.)

The only good solution I know is to simply admit that we can’t create a confidence interval for {p(x)}. Instead, we simply interpret {\hat p(x) \pm z_{\alpha/2} s_n(x)} as a confidence interval for {\overline{p}(x)}, which is a smoothed out (biased) version of {p(x)}:

\displaystyle  \overline{p}(x) = \int K(u) p(x+hu) \,du \approx p(x) + C h^2

where {C} is an unknown constant. In other words, we just live with the bias. (Of course, the bias gets smaller as the sample size gets larger and {h} gets smaller.)

Once we take this point of view, we can easily get a confidence band using the bootstrap. We draw {X_1^*,\ldots, X_n^*} from the empirical distribution {P_n} and compute the density estimator {\hat p^*}. By repeating this many times, we can find the quantile {z_\alpha} defined by

\displaystyle  P\Biggl( \sqrt{nh}||\hat p(x)^*-\hat p(x)||_\infty > z_\alpha \Biggm| X_1,\ldots, X_n\Biggr) = \alpha.

The confidence band is then {(\ell(x),u(x))} where

\displaystyle  \ell(x) = \hat p(x) - \frac{z_\alpha}{\sqrt{nh}},\ \ \ u(x) = \hat p(x) + \frac{z_\alpha}{\sqrt{nh}}.

We then have that

\displaystyle  P\Biggl( \ell(x) \leq \overline{p}(x) \leq u(x)\ \ \ {\rm for\ all\ }x\Biggr)\rightarrow 1-\alpha

as {n\rightarrow \infty}. Again, this is a confidence band for {\overline{p}(x)} rather than for {p(x)}.

Summary: There really is no good, practical method for getting true confidence bands in nonparametric function estimation. The reason is that the bias does not disappear fast enough as the sample size increases. The solution, I believe, is to just acknowledge that the confidence bands have a slight bias. Then we can use the bootstrap (or other methods) to compute the band. Like Dr. Strangelove, we just learn to stop worrying and love the bias.

16 Comments

  1. Posted August 19, 2013 at 3:00 pm | Permalink

    Neat, Cochran noted the same problem with confidence intervals from non-randomised studies.

    Unless you can bring in background/external information about the bias – your stuck with unquantified uncertainty.

  2. Christian Hennig
    Posted August 21, 2013 at 12:37 pm | Permalink

    One could wonder whether the smoothed density is a better target for inference than the true density with all kinds of potential irregularities anyway. A huge problem with densities is their striking non-robustness – in any reasonable neighbourhood of a distribution with a certain density there are distributions without density at all, or with locally arbitrarily different densities. Does anybody know whether smoothed densities (based on non-vanishing h, if necessary) are more robust in this respect?

    • Posted August 21, 2013 at 12:59 pm | Permalink

      You’re right. The mapping from a distribution to its desity is
      highly discontinuous but the map to the smooth density is
      continuous in the L_infinity norm.

    • Posted August 22, 2013 at 8:15 am | Permalink

      Some actually insist on this (the mapping from the distribution to the density being continuous) to ensure that it approximates finite objects given that in statistical applications all object are finite.

  3. Posted August 22, 2013 at 2:40 pm | Permalink

    I guess the core issue is that the space of densities is just too large. I don’t know much about confidence bands for densities, but for nonparametric regression if one assumes that the regression function belongs to a known smoothness class and the noise has sub-Gaussian tails with a known “sub-Gaussianity” constant, there seems to be no problem with coming up with confidence bands. An example of how this can be done is to assume that the regression function belongs to a known RKHS space. Then one may consider the ridge regression estimator that uses the squared RKHS norm. With appropriate tuning of the regularizer, this estimator will achieve the optimal rate for this given class of regression problems. Then, with some work, one can show that a confidence ellipsoid that generalizes of what would work in the finite dimensional case will still be valid and then with some algebra one can also get simultaneous confidence bands for the unknown regression function. Details of how this can be done can be found in the PhD Thesis of my former student, Yasin Abbasi-Yadkori (the thesis can be downloaded from here: https://era.library.ualberta.ca/public/datastream/get/uuid:c21661ea-0ecc-4ea8-9a9d-2b1b312a36e1/DS3, the thesis is an extension of our previous NIPS paper where we considered linear bandit problems and for these we needed the confidence bands). Although these are “honest” confidence bands, it should not be hard to “soften” these up to get confidence bands with “asymptotically correct” coverage.

    • Posted August 22, 2013 at 2:47 pm | Permalink

      Exactly the same problem happens in nonpar regression. It’s the same analysis.
      You can’t get correct bands in a sobolev space without undersmoothing.
      Furthermore, you can’t even adapt to the size of the Sobolev ball using the data.
      See:
      Mark Low (1997) Annals of Statistics, p 2547
      From a quick look, it appears to me that your student constructed a confidence ball.
      This does not give a band.

      • Posted September 11, 2013 at 11:53 pm | Permalink

        Theorem 3.11 in his thesis gives simultaneous confidence bands (based on the ball/ellipsoid in the RKHS). I will check out the paper, thanks for the reference.

  4. Tom
    Posted August 23, 2013 at 7:04 am | Permalink

    I could be mistaken but I thought Kendall, in volume I of his series, developed a standard deviation for the median.

    • Posted August 23, 2013 at 7:56 am | Permalink

      yes it is easy to get an exact confidence interval for the median.
      But this is unrelated to function estimation.

  5. Posted September 5, 2013 at 12:18 pm | Permalink

    The bootstrap examples again started with resampling from the empirical distribution of the “micro data”.

    But if one is interested in a property of the density (e.g. excess mass at/around a prescpecified point) is it wrong to calculate frequencies by bins, fit some flexible parametric form, and bootstrap with resampling the error, where the error is the discrepancy between the frequency in the bin vs the predicted density? This seems to be what Chetty et al. describe on page 23 (p. 25 of the PDF) of an influential paper [1] with their posted code [2] used in other papers since.

    Does this “macro level”, aggregate (block?) bootstrap do something different from resampling the empirical CDF? In any case, is the error in the parametric fit of the density the only noise we care about in the problem if we think that individual values (not frequencies) themselves have noise in them?

    I also forked this comment into a question on Cross Validate [3]. Also note the related question [4] about whether the estimand is regular enough to bootstrap or subsample in the first place, and with what rate of convergence (motivated by the Normal Deviate posts from January).

    [1]: http://obs.rc.fas.harvard.edu/chetty/denmark_adjcost_nber.pdf
    [2]: http://obs.rc.fas.harvard.edu/chetty/bunch_count.zip
    [3]: http://stats.stackexchange.com/q/69307/6534
    [4]: http://stats.stackexchange.com/questions/67613/is-excess-mass-estimation-smooth-enough-to-bootstrap-at-what-rate-might-a-bunch

    • Posted September 5, 2013 at 12:30 pm | Permalink

      That seems unnecessary and kind of complicated.
      And it would add additional bias.
      The regular bootstrap as I described should
      work fine.

      • Posted September 5, 2013 at 12:41 pm | Permalink

        Thanks! Any hints or thoughts on the source or the magnitude of the bias?

        On the complication: I think the procedure was partly motivated by the preference to work on collapsed/aggregate data (much, much faster), namely frequencies by bins. You would say those are not sufficient statistics of the density, or at least the bootstrap procedure wrongly introduces within-bin correlation? (Though if all the matters is the frequency in the bin, how bad is that?)

  6. Pete H
    Posted September 5, 2013 at 8:14 pm | Permalink

    Larry – I haven’t yet read this recent paper but it seems relevant to this discussion “A simple bootstrap method for constructing nonparametric confidence bands for functions” by Peter Hall and Joel Horowitz (http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aos/1378386242). I’d be interested to hear your thoughts on it if you have read it.

One Trackback

  1. By Dr Strangelove | filmstvandlife on August 21, 2013 at 12:20 am

    […] Statistics and Dr. Strangelove (normaldeviate.wordpress.com) […]