## Bootstrapping and Subsampling: Part II

BOOTSTRAPPING AND SUBSAMPLING: PART II

In part I we discussed the bootstrap and we saw that sometimes it does not work. By “doesn’t work” I mean that the coverage of the nominal ${1-\alpha}$ level confidence intervals does not approach ${1-\alpha}$ as the sample size increases. In this post I’ll discuss subsampling which works under broader conditions than the bootstrap.

1. Subsampling

Suppose that ${X_1,\ldots, X_n \sim P}$. Let ${\hat\theta_n = g(X_1,\ldots,X_n)}$ be some estimator of ${\theta = T(P)}$.

We need to make one main assumption:

Assume that ${n^\beta (\hat\theta_n - \theta)}$ converges in distribution to some continuous, non-degenerate distribution ${J}$.

We do not need to know ${J}$; we only need to know that such a ${J}$ exists. We don’t even need to know ${\beta}$ but for simplicity I’ll assume that ${\beta=1/2}$. So our assumption is that ${\sqrt{n} (\hat\theta_n - \theta)}$ converges to some ${J}$. Here is the subsampling algorithm.

1. Choose a number ${b = b_n}$ such that $\displaystyle b_n \rightarrow \infty,\ \ \ \ \frac{b_n}{n}\rightarrow 0$

as ${n\rightarrow\infty}$.

2. Draw ${b}$ observations (without replacement) from the data ${\{X_1,\ldots, X_n\}}$ This is one subsample. Repeat this ${N}$ times. So we have ${N}$ subsamples, each of size ${b}$. Denote these subsamples by ${S_1,\ldots, S_N}$.
3. From each subsample ${S_j}$ we compute the estimator. This yields ${N}$ values $\displaystyle \hat\theta_b^1,\ldots, \hat\theta_b^N.$

Note that each estimator is based on one subsample of size ${b}$.

4. Now define $\displaystyle L_n(t) = \frac{1}{N}\sum_{j=1}^N I\Bigl( \sqrt{b} (\hat\theta_b^j - \hat\theta_n) \leq t\Bigr)$

where ${I}$ denotes the indicator function.

5. Let ${\hat t_{\alpha/2}}$ and ${\hat t_{1-\alpha/2}}$ be the ${\alpha/2}$ and ${1-\alpha/2}$ quantiles of ${L_n}$: $\displaystyle \hat t_{\alpha/2} = L_n^{-1}(\alpha/2),\ \ \ \hat t_{1-\alpha/2} = L_n^{-1}(1-\alpha/2).$

6. Define the confidence interval $\displaystyle C_n = \left[\hat\theta_n - \frac{\hat t_{1-\alpha/2}}{\sqrt{n}},\ \hat\theta_n - \frac{\hat t_{\alpha/2}}{\sqrt{n}}\right].$

Theorem: $\displaystyle \mathop{\mathbb P}(\theta\in C_n)\rightarrow 1-\alpha$

as ${n\rightarrow\infty}$.

Note: There are ${N = \binom{n}{b}}$ possible subsamples of size ${b}$. In practice it is not necessary to use all of these subsamples. Usually, one selects, say, ${N=1,000}$ subsamples at random.

2. Why It Works

Here is an outline of the proof of the Theorem. (See Chapter 2 of Politis, Romano and Wolf 1999 for details.) Define $\displaystyle J_n(x) = \mathbb{P}( \sqrt{n}(\hat\theta_n - \theta) \leq x).$

By assumption ${J_n(x) = J(x) + o(1)}$. Also, ${J_b(x) = J(x) + o(1)}$ since ${b\rightarrow\infty}$ as ${n\rightarrow\infty}$. The key fact is that ${L_n(x)-J_b(x)\stackrel{P}{\rightarrow}0}$ as ${n\rightarrow \infty}$. To see this, note that $\displaystyle \sqrt{b} (\hat\theta_b - \hat\theta_n) = \sqrt{b} (\hat\theta_b - \theta) + \sqrt{\frac{b}{n}}\ \sqrt{n}(\theta-\hat\theta_n)= \sqrt{b} (\hat\theta_b - \theta) + R_n$

where ${R_n= o_P(1)}$. This follows since ${b/n \rightarrow 0}$ and since ${\sqrt{n}(\theta-\hat\theta_n)}$ is converging in distribution. So, $\displaystyle L_n(x) \approx U(x)$

where $\displaystyle U(x) = \frac{1}{N}\sum_{j=1}^N I( \sqrt{b} (\hat\theta_b^j - \theta) \leq x).$

By Hoeffding’s inequality for U-statistics, we have that, for every ${\epsilon>0}$, $\displaystyle \mathbb{P} ( |U(x) - J_b(x)| > \epsilon) \leq 2 \exp\left( - \frac{2 n \epsilon^2}{b}\right)$

which goes to 0 since ${n/b\rightarrow \infty}$ as ${n\rightarrow\infty}$. So $\displaystyle L_n(x) \approx U(x) \approx J_b(x) \approx J_n(x) \approx J(x).$

The coverage of ${C_n}$ is $\displaystyle P\left(\hat\theta_n - \frac{\hat t_{1-\alpha/2}}{\sqrt{n}} \leq \theta \leq \hat\theta_n - \frac{\hat t_{\alpha/2}}{\sqrt{n}}\right) = P(\hat{t}_{\alpha/2} \leq \sqrt{n}(\hat\theta_n-\theta) \leq \hat{t}_{1-\alpha/2})= J_n(\hat t_{1-\alpha/2}) - J_n(\hat t_{\alpha/2}).$

But $\displaystyle J_n(\hat t_{1-\alpha/2}) - J_n(\hat t_{\alpha/2}) \approx L_n(\hat t_{1-\alpha/2}) - L_n(\hat t_{\alpha/2}) = \left(1-\frac{\alpha}{2}\right) - \frac{\alpha}{2} = 1-\alpha.$

The amazing thing about subsampling is that there are very few regularity conditions. It works under almost no conditions. This is in contrast to the bootstrap which does require fairly strong conditions to yield valid confidence intervals.

3. What’s the Catch?

There are a few catches. First, you need to choose ${b}$. The only requirement is that ${b/n\rightarrow 0}$ and ${b\rightarrow\infty}$ as ${n\rightarrow\infty}$. It’s great that it works for such a large range of values but we need a way to pick ${b}$.

Fortunately, there are methods for choosing ${b}$. I won’t go into detail, rather, I’ll point you to Chapter 9 of Politis, Romano, and (1999) and also Bickel and Sakov (2008).

A more serious problem is that the confidence guarantee is only an asymptotic guarantee. However, this is true of many statistical methods. I prefer to use methods with finite sample guarantees whenever possible but sometimes there is no choice but to resort to large sample approximations. Come to think of it, a good topic for a future blog post would be: for which problems are there accurate methods with finite sample guarantees?

Another concern about subsampling and bootstrapping is that they are computationally intensive. For a recent and very interesting approach to dealing with the computational burden, see this recent paper by Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar and Mike Jordan.

By the way, see this paper by Joe Romano and Azeem Shaikh for some recent theoretical advances.

4. Conclusion

The bootstrap and subsampling are powerful techniques for constructing confidence intervals. Subsampling works under weaker conditions and relies on simpler theory. However, subsampling requires choosing a tuning parameter (the size ${b}$ of the subsamples) which probably explains why it is less popular. I think that if we had a fast method for automatically choosing ${b}$, then subsampling would replace bootstrapping as the method of choice.

References

Bickel, P.J. and Sakov, A. (2008). On the choice of m in the m out of n bootstrap and confidence bounds for extrema. Statistica Sinica, 18, 967-985.

Politis, D.N. and Romano, J.P. and Wolf, M. (1999). Subsampling, Springer Verlag.

1. peterthenelson
Posted January 28, 2013 at 2:21 am | Permalink

For those of us who might have difficulty tracking down “Politis, Romano (1999)” and “Bickel and Sakov (2008)”, do you have any practical advice on choosing a b?

• Martin Azizyan
Posted January 28, 2013 at 9:01 am | Permalink

Bickel and Sakov (2008) is a paper, available here: http://www.stat.berkeley.edu/~bickel/BS2008SS.pdf

• normaldeviate
Posted January 28, 2013 at 9:19 am | Permalink

Thanks Martin

2. Martin Azizyan
Posted January 28, 2013 at 6:01 am | Permalink

A summary of the method for picking b from the Bickel and Sakov paper:

1) For fixed 0<q<1, let b_k = ceiling(q^k * n), for all k=0,1,2,…
2) For each b_k, find L_{b_k} (the bootstrap CDF of the test statistic with b_k samples; I believe they use a sampling with replacement version). I'm not sure whether they do the same normalization by sqrt(b_k) that Larry described, but it might not matter since it corresponds to a rescaling of t.
3) Let \hat{b} = \argmin_{b_k} \sup_t | L_{b_k}(t) – L_{b_{k+1}}(t) |. Beak ties by picking the largest b_k.

Instead of \sup_t | F(t)-G(t) | they allow other "metrics consistent with convergence in law", but only consider the sup difference in the theory. I'm not sure if there is a way to compute this distance without considering all values of t where there is a discontinuity, i.e. all 2*N samples that were used to estimate L_{b_k} and L_{b_{k+1}}. Plus one would have to do that for each k…

I hope I got everything right.

I guess the idea is that when b starts becoming "too large", we start observing big changes in the distribution of the test statistic due to bias (or something like that).

3. anon
Posted January 28, 2013 at 3:22 pm | Permalink

Don’t you need to know \beta when constructing the subsampling distribution L?

• normaldeviate
Posted January 28, 2013 at 3:23 pm | Permalink

No. You can actually estimate that too.

4. László Sándor
Posted January 29, 2013 at 10:18 am | Permalink

This is really helpful, thanks. Back to an older question of mine, would you say that subsampling is more promising for “bunching” estimates that people try to do from big data, distributions these days? So far, bootstrapping was what all there was, see e.g. http://obs.rc.fas.harvard.edu/chetty/denmark_adjcost.pdf

I also wonder whether this is could be used from regression discontinuity estimates, the literature seems to be heading into a different direction, e.g. http://www-personal.umich.edu/~cattaneo/papers/RD-biascorrection.pdf

And by the way, if we are “sticking to the data” this much, are we getting standard errors (closer to those) conditional on covariates, or not?
I basically mean this paper: http://www.nber.org/papers/w17442.pdf

5. Andrew Beam
Posted January 30, 2013 at 9:03 pm | Permalink

I would love to see this lead into a post about bagging. Bagging is obviously a technique widely used in machine learning (which as you mentioned the bootstrap is not) and I would love to hear this type of discussion on it.

6. Mark Schaffer
Posted January 31, 2013 at 8:28 am | Permalink

First of all, thanks for writing this – really interesting. And a question – what if your data aren’t i.i.d.? There are bootstrap flavours for that, and I wonder if there are subsampling flavours as well.

• normaldeviate
Posted January 31, 2013 at 8:32 am | Permalink

Yes. That was one of the motivations for subsampling, it works on dependent data.
You need to divide the data into blocks and subsample blocks.
Most of the Romano book is actually about the dependent case.
If you google “time series subsampling” you’ll
find lots of good stuff.

• Mark Schaffer
Posted January 31, 2013 at 8:49 am | Permalink

Great, thanks!

7. Christian Hennig
Posted February 1, 2013 at 7:46 am | Permalink

I see a problem with this kind of asymptotic result that is not a problem with all asymptotic results in statistics (but with some others, too). This is that in reality we have a fixed n and a fixed b, and the asymptotic relies on *how b changes with n*. Now the interpretation of increasing n is clear: We observe more and get more information. No statistician would *choose* to have a small n if large n is available (OK, let’s say I don’t bother for the moment with situations where she would). However, what happens to b if n goes to infinity is actually *decided* by the statistician. And it is only *hypothetically* decided by the statistician if in reality there is only one fixed n and one fixed b. So if my n is 500 and I choose b to be 250, whether the asymptotic result holds or not depends on what b I’d choose for n=1,000, 10,000, 1 billion and so on, which I don’t need to bother about because that’s not the n that I have.
So how useful is an asymptotic result in a given real situation if its validity depends on how the reader believes the statistician would choose b with much larger n, which is a choice that in fact the statistician never has to make?

• normaldeviate
Posted February 1, 2013 at 8:40 am | Permalink

I agree. I think with some work, something could be said about the finite sample properties although
it would not be easy.

8. Laszlo
Posted August 16, 2013 at 1:00 pm | Permalink

If you’re interested, there is a user mailing list feed on subsampling in Stata referencing this post and discussing pros and cons and caveats:
http://www.stata.com/statalist/archive/2013-08/msg00555.html (and nexts in threads, of course)