**BOOTSTRAPPING AND SUBSAMPLING: PART II**

In part I we discussed the bootstrap and we saw that sometimes it does not work. By “doesn’t work” I mean that the coverage of the nominal level confidence intervals does not approach as the sample size increases. In this post I’ll discuss subsampling which works under broader conditions than the bootstrap.

**1. Subsampling **

Suppose that . Let be some estimator of .

We need to make one main assumption:

Assume that converges in distribution to some continuous, non-degenerate distribution .

We do not need to know ; we only need to know that such a exists. We don’t even need to know but for simplicity I’ll assume that . So our assumption is that converges to some . Here is the subsampling algorithm.

- Choose a number such that
as .

- Draw observations (without replacement) from the data This is one subsample. Repeat this times. So we have subsamples, each of size . Denote these subsamples by .
- From each subsample we compute the estimator. This yields values
Note that each estimator is based on one subsample of size .

- Now define
where denotes the indicator function.

- Let and be the and quantiles of :
- Define the confidence interval

**Theorem:**

as .

Note: There are possible subsamples of size . In practice it is not necessary to use all of these subsamples. Usually, one selects, say, subsamples at random.

**2. Why It Works **

Here is an outline of the proof of the Theorem. (See Chapter 2 of Politis, Romano and Wolf 1999 for details.) Define

By assumption . Also, since as . The key fact is that as . To see this, note that

where . This follows since and since is converging in distribution. So,

where

By Hoeffding’s inequality for U-statistics, we have that, for every ,

which goes to 0 since as . So

The coverage of is

But

The amazing thing about subsampling is that there are very few regularity conditions. It works under almost no conditions. This is in contrast to the bootstrap which does require fairly strong conditions to yield valid confidence intervals.

**3. What’s the Catch? **

There are a few catches. First, you need to choose . The only requirement is that and as . It’s great that it works for such a large range of values but we need a way to pick .

Fortunately, there are methods for choosing . I won’t go into detail, rather, I’ll point you to Chapter 9 of Politis, Romano, and (1999) and also Bickel and Sakov (2008).

A more serious problem is that the confidence guarantee is only an asymptotic guarantee. However, this is true of many statistical methods. I prefer to use methods with finite sample guarantees whenever possible but sometimes there is no choice but to resort to large sample approximations. Come to think of it, a good topic for a future blog post would be: for which problems are there accurate methods with finite sample guarantees?

Another concern about subsampling and bootstrapping is that they are computationally intensive. For a recent and very interesting approach to dealing with the computational burden, see this recent paper by Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar and Mike Jordan.

By the way, see this paper by Joe Romano and Azeem Shaikh for some recent theoretical advances.

**4. Conclusion **

The bootstrap and subsampling are powerful techniques for constructing confidence intervals. Subsampling works under weaker conditions and relies on simpler theory. However, subsampling requires choosing a tuning parameter (the size of the subsamples) which probably explains why it is less popular. I think that if we had a fast method for automatically choosing , then subsampling would replace bootstrapping as the method of choice.

** References **

Bickel, P.J. and Sakov, A. (2008). On the choice of m in the m out of n bootstrap and confidence bounds for extrema. *Statistica Sinica*, 18, 967-985.

Politis, D.N. and Romano, J.P. and Wolf, M. (1999). *Subsampling*, Springer Verlag.

## 14 Comments

For those of us who might have difficulty tracking down “Politis, Romano (1999)” and “Bickel and Sakov (2008)”, do you have any practical advice on choosing a b?

Bickel and Sakov (2008) is a paper, available here: http://www.stat.berkeley.edu/~bickel/BS2008SS.pdf

Thanks Martin

A summary of the method for picking b from the Bickel and Sakov paper:

1) For fixed 0<q<1, let b_k = ceiling(q^k * n), for all k=0,1,2,…

2) For each b_k, find L_{b_k} (the bootstrap CDF of the test statistic with b_k samples; I believe they use a sampling with replacement version). I'm not sure whether they do the same normalization by sqrt(b_k) that Larry described, but it might not matter since it corresponds to a rescaling of t.

3) Let \hat{b} = \argmin_{b_k} \sup_t | L_{b_k}(t) – L_{b_{k+1}}(t) |. Beak ties by picking the largest b_k.

Instead of \sup_t | F(t)-G(t) | they allow other "metrics consistent with convergence in law", but only consider the sup difference in the theory. I'm not sure if there is a way to compute this distance without considering all values of t where there is a discontinuity, i.e. all 2*N samples that were used to estimate L_{b_k} and L_{b_{k+1}}. Plus one would have to do that for each k…

I hope I got everything right.

I guess the idea is that when b starts becoming "too large", we start observing big changes in the distribution of the test statistic due to bias (or something like that).

Don’t you need to know \beta when constructing the subsampling distribution L?

No. You can actually estimate that too.

This is really helpful, thanks. Back to an older question of mine, would you say that subsampling is more promising for “bunching” estimates that people try to do from big data, distributions these days? So far, bootstrapping was what all there was, see e.g. http://obs.rc.fas.harvard.edu/chetty/denmark_adjcost.pdf

I also wonder whether this is could be used from regression discontinuity estimates, the literature seems to be heading into a different direction, e.g. http://www-personal.umich.edu/~cattaneo/papers/RD-biascorrection.pdf

And by the way, if we are “sticking to the data” this much, are we getting standard errors (closer to those) conditional on covariates, or not?

I basically mean this paper: http://www.nber.org/papers/w17442.pdf

I would love to see this lead into a post about bagging. Bagging is obviously a technique widely used in machine learning (which as you mentioned the bootstrap is not) and I would love to hear this type of discussion on it.

First of all, thanks for writing this – really interesting. And a question – what if your data aren’t i.i.d.? There are bootstrap flavours for that, and I wonder if there are subsampling flavours as well.

Yes. That was one of the motivations for subsampling, it works on dependent data.

You need to divide the data into blocks and subsample blocks.

Most of the Romano book is actually about the dependent case.

If you google “time series subsampling” you’ll

find lots of good stuff.

Great, thanks!

I see a problem with this kind of asymptotic result that is not a problem with all asymptotic results in statistics (but with some others, too). This is that in reality we have a fixed n and a fixed b, and the asymptotic relies on *how b changes with n*. Now the interpretation of increasing n is clear: We observe more and get more information. No statistician would *choose* to have a small n if large n is available (OK, let’s say I don’t bother for the moment with situations where she would). However, what happens to b if n goes to infinity is actually *decided* by the statistician. And it is only *hypothetically* decided by the statistician if in reality there is only one fixed n and one fixed b. So if my n is 500 and I choose b to be 250, whether the asymptotic result holds or not depends on what b I’d choose for n=1,000, 10,000, 1 billion and so on, which I don’t need to bother about because that’s not the n that I have.

So how useful is an asymptotic result in a given real situation if its validity depends on how the reader believes the statistician would choose b with much larger n, which is a choice that in fact the statistician never has to make?

I agree. I think with some work, something could be said about the finite sample properties although

it would not be easy.

If you’re interested, there is a user mailing list feed on subsampling in Stata referencing this post and discussing pros and cons and caveats:

http://www.stata.com/statalist/archive/2013-08/msg00555.html (and nexts in threads, of course)

## One Trackback

[…] – Sobre bootstraping I e II; […]