Monthly Archives: January 2013

Bootstrapping and Subsampling: Part II

BOOTSTRAPPING AND SUBSAMPLING: PART II

In part I we discussed the bootstrap and we saw that sometimes it does not work. By “doesn’t work” I mean that the coverage of the nominal ${1-\alpha}$ level confidence intervals does not approach ${1-\alpha}$ as the sample size increases. In this post I’ll discuss subsampling which works under broader conditions than the bootstrap.

1. Subsampling

Suppose that ${X_1,\ldots, X_n \sim P}$. Let ${\hat\theta_n = g(X_1,\ldots,X_n)}$ be some estimator of ${\theta = T(P)}$.

We need to make one main assumption:

Assume that ${n^\beta (\hat\theta_n - \theta)}$ converges in distribution to some continuous, non-degenerate distribution ${J}$.

We do not need to know ${J}$; we only need to know that such a ${J}$ exists. We don’t even need to know ${\beta}$ but for simplicity I’ll assume that ${\beta=1/2}$. So our assumption is that ${\sqrt{n} (\hat\theta_n - \theta)}$ converges to some ${J}$. Here is the subsampling algorithm.

1. Choose a number ${b = b_n}$ such that

$\displaystyle b_n \rightarrow \infty,\ \ \ \ \frac{b_n}{n}\rightarrow 0$

as ${n\rightarrow\infty}$.

2. Draw ${b}$ observations (without replacement) from the data ${\{X_1,\ldots, X_n\}}$ This is one subsample. Repeat this ${N}$ times. So we have ${N}$ subsamples, each of size ${b}$. Denote these subsamples by ${S_1,\ldots, S_N}$.
3. From each subsample ${S_j}$ we compute the estimator. This yields ${N}$ values

$\displaystyle \hat\theta_b^1,\ldots, \hat\theta_b^N.$

Note that each estimator is based on one subsample of size ${b}$.

4. Now define

$\displaystyle L_n(t) = \frac{1}{N}\sum_{j=1}^N I\Bigl( \sqrt{b} (\hat\theta_b^j - \hat\theta_n) \leq t\Bigr)$

where ${I}$ denotes the indicator function.

5. Let ${\hat t_{\alpha/2}}$ and ${\hat t_{1-\alpha/2}}$ be the ${\alpha/2}$ and ${1-\alpha/2}$ quantiles of ${L_n}$:

$\displaystyle \hat t_{\alpha/2} = L_n^{-1}(\alpha/2),\ \ \ \hat t_{1-\alpha/2} = L_n^{-1}(1-\alpha/2).$

6. Define the confidence interval

$\displaystyle C_n = \left[\hat\theta_n - \frac{\hat t_{1-\alpha/2}}{\sqrt{n}},\ \hat\theta_n - \frac{\hat t_{\alpha/2}}{\sqrt{n}}\right].$

Theorem:

$\displaystyle \mathop{\mathbb P}(\theta\in C_n)\rightarrow 1-\alpha$

as ${n\rightarrow\infty}$.

Note: There are ${N = \binom{n}{b}}$ possible subsamples of size ${b}$. In practice it is not necessary to use all of these subsamples. Usually, one selects, say, ${N=1,000}$ subsamples at random.

2. Why It Works

Here is an outline of the proof of the Theorem. (See Chapter 2 of Politis, Romano and Wolf 1999 for details.) Define

$\displaystyle J_n(x) = \mathbb{P}( \sqrt{n}(\hat\theta_n - \theta) \leq x).$

By assumption ${J_n(x) = J(x) + o(1)}$. Also, ${J_b(x) = J(x) + o(1)}$ since ${b\rightarrow\infty}$ as ${n\rightarrow\infty}$. The key fact is that ${L_n(x)-J_b(x)\stackrel{P}{\rightarrow}0}$ as ${n\rightarrow \infty}$. To see this, note that

$\displaystyle \sqrt{b} (\hat\theta_b - \hat\theta_n) = \sqrt{b} (\hat\theta_b - \theta) + \sqrt{\frac{b}{n}}\ \sqrt{n}(\theta-\hat\theta_n)= \sqrt{b} (\hat\theta_b - \theta) + R_n$

where ${R_n= o_P(1)}$. This follows since ${b/n \rightarrow 0}$ and since ${\sqrt{n}(\theta-\hat\theta_n)}$ is converging in distribution. So,

$\displaystyle L_n(x) \approx U(x)$

where

$\displaystyle U(x) = \frac{1}{N}\sum_{j=1}^N I( \sqrt{b} (\hat\theta_b^j - \theta) \leq x).$

By Hoeffding’s inequality for U-statistics, we have that, for every ${\epsilon>0}$,

$\displaystyle \mathbb{P} ( |U(x) - J_b(x)| > \epsilon) \leq 2 \exp\left( - \frac{2 n \epsilon^2}{b}\right)$

which goes to 0 since ${n/b\rightarrow \infty}$ as ${n\rightarrow\infty}$. So

$\displaystyle L_n(x) \approx U(x) \approx J_b(x) \approx J_n(x) \approx J(x).$

The coverage of ${C_n}$ is

$\displaystyle P\left(\hat\theta_n - \frac{\hat t_{1-\alpha/2}}{\sqrt{n}} \leq \theta \leq \hat\theta_n - \frac{\hat t_{\alpha/2}}{\sqrt{n}}\right) = P(\hat{t}_{\alpha/2} \leq \sqrt{n}(\hat\theta_n-\theta) \leq \hat{t}_{1-\alpha/2})= J_n(\hat t_{1-\alpha/2}) - J_n(\hat t_{\alpha/2}).$

But

$\displaystyle J_n(\hat t_{1-\alpha/2}) - J_n(\hat t_{\alpha/2}) \approx L_n(\hat t_{1-\alpha/2}) - L_n(\hat t_{\alpha/2}) = \left(1-\frac{\alpha}{2}\right) - \frac{\alpha}{2} = 1-\alpha.$

The amazing thing about subsampling is that there are very few regularity conditions. It works under almost no conditions. This is in contrast to the bootstrap which does require fairly strong conditions to yield valid confidence intervals.

3. What’s the Catch?

There are a few catches. First, you need to choose ${b}$. The only requirement is that ${b/n\rightarrow 0}$ and ${b\rightarrow\infty}$ as ${n\rightarrow\infty}$. It’s great that it works for such a large range of values but we need a way to pick ${b}$.

Fortunately, there are methods for choosing ${b}$. I won’t go into detail, rather, I’ll point you to Chapter 9 of Politis, Romano, and (1999) and also Bickel and Sakov (2008).

A more serious problem is that the confidence guarantee is only an asymptotic guarantee. However, this is true of many statistical methods. I prefer to use methods with finite sample guarantees whenever possible but sometimes there is no choice but to resort to large sample approximations. Come to think of it, a good topic for a future blog post would be: for which problems are there accurate methods with finite sample guarantees?

Another concern about subsampling and bootstrapping is that they are computationally intensive. For a recent and very interesting approach to dealing with the computational burden, see this recent paper by Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar and Mike Jordan.

By the way, see this paper by Joe Romano and Azeem Shaikh for some recent theoretical advances.

4. Conclusion

The bootstrap and subsampling are powerful techniques for constructing confidence intervals. Subsampling works under weaker conditions and relies on simpler theory. However, subsampling requires choosing a tuning parameter (the size ${b}$ of the subsamples) which probably explains why it is less popular. I think that if we had a fast method for automatically choosing ${b}$, then subsampling would replace bootstrapping as the method of choice.

References

Bickel, P.J. and Sakov, A. (2008). On the choice of m in the m out of n bootstrap and confidence bounds for extrema. Statistica Sinica, 18, 967-985.

Politis, D.N. and Romano, J.P. and Wolf, M. (1999). Subsampling, Springer Verlag.

Bootstrapping and Subsampling: Part I

BOOTSTRAPPING AND SUBSAMPLING: PART I

Bootstrapping and subsampling are in the “amazing” category in statistics. They seem much more popular in statistics than machine learning for some reason.

1. The Bootstrap

The bootstrap (a.k.a. the shotgun) was invented by Brad Efron. Here is how it works. We have data ${X_1,\ldots, X_n \sim P}$ and we want a confidence interval for ${\theta = T(P)}$. For example, ${\theta}$ could be the median of ${P}$ or the mean of ${P}$ or something more complicated like, the largest eigenvalue of the covariance matrix of ${P}$.

The ${1-\alpha}$ bootstrap confidence interval is

$\displaystyle C_n = \left[ \hat\theta - \frac{\hat t_{1-\alpha/2}}{\sqrt{n}},\ \hat\theta - \frac{\hat t_{\alpha/2}}{\sqrt{n}}\right]$

where ${\hat\theta}$ is an estimator of ${\theta}$ and ${\hat t_{\alpha/2}}$ and ${\hat t_{1-\alpha/2}}$ are sample bootstrap quantiles that I will describe below. Before I explain this in more detail, notice two things. First, there is a minus sign in both the lower and upper endpoint. Second, the ${\alpha/2}$ and ${1-\alpha/2}$ quantiles are in the upper and lower endpoints, the reverse of what you might expect. The reason for the strange looking interval will be clear when we derive the interval.

Now for some details. Think of the parameter of interest ${\theta}$ as a function of the unknown distribution, which is why we write it as ${\theta = T(P)}$. Let ${P_n}$ denote the empirical distribution:

$\displaystyle P_n(A) = \frac{1}{n}\sum_{i=1}^n I_A(X_i).$

In other words, ${P_n}$ is the distribution that puts mass ${1/n}$ at each ${X_i}$.

The estimator is just the function ${T}$ applied to ${P_n}$, that is, ${\hat\theta = T(P_n)}$. For example, if ${\theta}$ is the median of ${P}$ then ${\hat\theta}$ is the median of ${P_n}$ which is just the sample median.

Now let

$\displaystyle R_n = \sqrt{n}(\hat\theta - \theta).$

We use ${R_n}$ because typically it converges in distribution to some well-defined distribution (such as a Normal). Now let ${H_n}$ denote the (unknown) distribution of ${R_n}$:

$\displaystyle H_n(t) = \mathbb{P}(R_n \leq t).$

Suppose, for a moment, that we did know ${H_n}$. We could then find the ${\alpha/2}$ quantile ${t_{\alpha/2}}$ and the ${1-\alpha/2}$ quantile ${t_{1-\alpha/2}}$, namely,

$\displaystyle t_{\alpha/2} = H_n^{-1}(\alpha/2)\ \ \ \mbox{and}\ \ \ t_{1-\alpha/2} = H_n^{-1}(1-\alpha/2).$

It follows that

$\displaystyle \mathbb{P}( t_{\alpha/2} \leq R_n \leq t_{1-\alpha/2}) = 1- \alpha.$

Continuing with the fantasy that we know ${H_n}$, define

$\displaystyle \Omega_n = \left[ \hat\theta - \frac{t_{1-\alpha/2}}{\sqrt{n}},\ \hat\theta - \frac{t_{\alpha/2}}{\sqrt{n}}\right].$

Now I will show you that ${\Omega_n}$ is an exact ${1-\alpha}$ confidence interval. This follows since

$\displaystyle \mathbb{P}(\theta \in \Omega_n) = \mathbb{P}\left(\hat\theta - \frac{t_{1-\alpha/2}}{\sqrt{n}} \leq \theta \leq \hat\theta - \frac{t_{\alpha/2}}{\sqrt{n}}\right)= \mathbb{P}(t_{\alpha/2} \leq R_n \leq t_{1-\alpha/2}) = H_n(t_{1-\alpha/2}) - H_n(t_{\alpha/2})= (1-\alpha/2) - \alpha/2 = 1-\alpha.$

We engineered ${\Omega_n}$ so that the last line would be exactly ${1-\alpha}$. The strange form of ${\Omega_n}$ is explained by the fact that we really have a probability statement for ${R_n}$ which we then manipulate into the form of an interval for ${\theta}$. (You can check that if ${H_n}$ were standard Gaussian, then using the symmetry of the Gaussian, the interval could be re-written in the more familiar looking form ${\hat\theta\pm z_{\alpha/2}/\sqrt{n}}$ where ${z_{\alpha/2}}$ is the upper-tail quantile of a Normal.)

The problem is that we don’t know ${H_n}$ and hence we don’t know ${t_{\alpha/2}}$ or ${t_{1-\alpha/2}}$. The bootstrap is a method for approximating ${H_n}$. Let ${B}$ be a large number (for example, ${B=100,000}$.) Now do this:

1. Draw ${n}$ observations ${X_1^*,\ldots, X_n^*}$ from ${P_n}$ and compute ${\hat\theta^*}$ from these new data.
2. Repeat step 1 ${B}$ times yielding values ${\hat\theta_1^*,\ldots, \hat\theta_B^*}$.
3. Approximate ${H_n}$ with

$\displaystyle \hat H_n(t) = \frac{1}{B}\sum_{j=1}^B I\Bigl( \sqrt{n}(\hat\theta_j^* - \hat\theta)\leq t\Bigr)$

where ${I}$ denotes the indicator function.

4. Find the quantiles ${\hat t_{\alpha/2}}$ and ${\hat t_{1-\alpha/2}}$ of ${\hat H_n}$ and construct ${C_n}$ as defined earlier.

The interval ${C_n}$ is the same as ${\Omega_n}$ except we use the estimated quantiles for ${C_n}$. What we are doing here is estimating ${H_n}$ by using ${P_n}$ as an estimate of ${P}$. (That’s why we draw ${X_1^*,\ldots, X_n^*}$ from ${P_n}$.) If ${\hat H_n}$ is close to ${H_n}$ then ${\hat t_{\alpha/2} \approx t_{\alpha/2}}$ and ${\hat t_{1-\alpha/2} \approx t_{1-\alpha/2}}$ and then ${C_n \approx \Omega_n}$.

There are two sources of error. First we approximate

$\displaystyle H_n(t)=\mathbb{P}(\sqrt{n} (\hat\theta - \theta))$

with

$\displaystyle H_n^*(t)=\mathbb{P}_n(\sqrt{n} (\hat\theta^* - \hat\theta)).$

Essentially, we are replacing ${P}$ with ${P_n}$. Second, we are approximating ${H_n^*(t)}$ with

$\displaystyle \hat H_n(t) = \frac{1}{B}\sum_{j=1}^B I\Bigl( \sqrt{n}(\hat\theta_j^* - \hat\theta)\leq t\Bigr).$

This second source of error is negligible because we can make ${B}$ as large as we want.

Remark: A moment’s reflection should convince you that drawing a sample of size ${n}$ from ${P_n}$ is the same as drawing ${n}$ points with replacement from the original data. This is how the bootstrap is often described but I think it is clearer to describe it as drawing ${n}$ observations from ${P_n}$.

2. Why Does It Work?

If ${\hat H_n}$ is close to ${H_n}$ then the bootstrap confidence interval will have coverage close to ${1-\alpha}$. Formally, one has to show that

$\displaystyle \sup_t |\hat H_n(t) - H_n(t)| \stackrel{P}{\rightarrow} 0$

in which case

$\displaystyle \mathbb{P}(\theta\in C_n) \rightarrow 1-\alpha$

as ${n\rightarrow\infty}$.

It is non-trivial to show that ${\sup_t |\hat H_n(t) - H_n(t)| \stackrel{P}{\rightarrow} 0}$ but it has been shown in some generality. See Chapter 23 of van der Vaart (1998) for example.

3. Why Does It Fail?

The bootstrap does not always work. It can fail for a variety of reasons such as when the dimension is high or when ${T}$ is poorly behaved.

An example of a bootstrap failure is in the problem of estimating phylogenetic trees. The problem here is that ${T(P)}$ is an extremely complex object and the regularity conditions needed to make the bootstrap work are unlikely to hold.

In fact, this is a general problem with the bootstrap: it is most useful in complex situations, but these are often the situations where the theory breaks down.

4. What Do We Do?

So what do we do when the bootstrap fails? One answer is: subsampling. This is a variant of the bootstrap that works under much weaker conditions than the bootstrap. Interestingly, the theory behind subsampling is much simpler than the theory behind the bootstrap. The former involves little more than a simple concentration inequality while the latter uses high-powered techniques from empirical process theory.

So what is subsampling?

Stay tuned. I will describe it in my next post. In the meantime, I’d be interested to hear about your experiences with the bootstrap. Also, why do you think the bootstrap is not more popular in machine learning?

References

Efron, Bradley. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 1-26.

Efron, Bradley, and Tibshirani, R. (1994). An Introduction to the Bootstrap. Chapman and Hall.

Holmes, Susan. (2003). Bootstrapping phylogenetic trees: Theory and methods. Statistical Science, 241-255.

van der Vaart, A. (1996). Asymptotic Statistics. Cambridge.

here

Review of “Antifragile” by Nassim Taleb (a.k.a. Doc Savage)

REVIEW OF: “ANTIFRAGILE” by NASSIM TALEB (a.k.a. Doc Savage)

I have not yet finished reading Antifragile by Nassim Taleb. I am at page 214 of this 519 page book. Isn’t it sacrilege to write a review before finishing the entire book? Normally, yes. But you’ll see why this is an exception.

Taleb is well-known for his previous books such as Fooled By Randomness and The Black Swan.

I read Fooled By Randomness a long time ago and, as best as I can recall I liked it. I think the message was: “all traders except for me are idiots.” The message of The Black Swan was that outliers matter. It may sound trite but it is an important point; he is right that many people use models and then forget that they are just approximations. The Black Swan made a lot of people mad. He gave the impression that all statisticians were idiots and didn’t know about outliers and model violations. Nevertheless, I think he did have interesting things to say.

Antifragile continues the Black Swan theme but the arrogant tone has been taken up a notch.

As with Taleb’s other books, there are interesting ideas here. His main point is this: there is no word in the English language to mean the opposite of fragile. You might think that “resilient” or “robust” is the opposite of fragile but that’s not right. A system is fragile if it is sensitive to errors. A system is resilient if it is insensitive to errors. A system is antifragile if it improves with errors.

To understand antifragility, think of things that lead to improvement by trial-and-error. Evolution is an example. Entrepreneurship is another.

Generally, top-down, bureaucratic things tend to be fragile. Bottom-up, decentralized things tend to be anti-fragile. He refers to meddlers who want to impose centralized — and hence fragile — decision making on people as “fragilistas.” I love that word.

I like his ideas about antifragility. I share his dislike for centralized decision-making, bureaucrats, (as well as his dislike of Paul Krugman and Thomas Friedman). So I really wanted to like this book.

The problem is the tone. The somewhat arrogant tone of his previous books has evolved into a kind of belligerent yelling. The “I am smart and everyone else is an idiot” shtick gets tiresome. Having dinner with your know-it-all uncle is tolerable. But spend too much time with him and you’ll go mad.

The book is full bragging; there are continuous references to his amazing wonderful travels, all the cafe’s he has been in around the world, zillions of references to historical and philosophical texts and a steady stream of his likes and dislikes. He particularly dislikes academics, business schools, and especially Harvard. He often talks about the Harvard-Soviet empire. He got an MBA for Wharton where he credits an un-named professor for teaching him about options. But, of course, the professor did not really understand what he was teaching.

We find out that Taleb hates TV, air-conditioning, sissies, and most economists. He has taken up weightlifting and, using a training technique he learned from “Lenny Cake” he can deadlift 350 pounds. He is now so strong that people mistake him for a bodyguard. You can’t make this stuff up.

I think that Taleb is Doc Savage (the Man of Bronze). For those who don’t know, Doc Savage was the hero in a series of books in the 1960’s. Doc Savage was amazing. He was brilliant and strong. He was a scientist, detective, surgeon, inventor, martial arts expert and had a photographic memory. It was hilarious reading those books when I was young because they were so over the top.

Antifragile is “The Black Swan meets Doc Savage.” This is a real shame because, as I said, there are interesting ideas here. But I doubt many serious readers will be able to stomach the whole book. I read some Amazon reviews and I noticed that they were quite bimodal. Many people seem to believe he is the gigantic genius he claims to be and they love the book. More thoughtful readers are put off by the tone and give negative reviews.

If an editor had forced him to tone it down (and make the writing a little more organized) then I think this book would have been good. But he puts editors in the fragilista category. I can imagine the editor trying to give Taleb some suggestions only to be slapped around by author.

Which is why I decided to write a review before finishing the book. You see, only sissy fragalistas finish a book before reviewing it.

Disclosure: I am an Amazon associate. This means that if you click on any links to Amazon in my posts, then I get credit towards a gift certificate.

TO CONDITION, OR NOT TO CONDITION, THAT IS THE QUESTION

TO CONDITION, OR NOT TO CONDITION, THAT IS THE QUESTION

Between the completely conditional world of Bayesian inference and the completely unconditional world of frequentist inference lies the hazy world of conditional inference.

The extremes are easy. In Bayesian-land you condition on all of the data. In Frequentist-land, you condition on nothing. If your feet are firmly planted in either of these idyllic places, read no further! Because, conditional inference is:

The undiscovered Country, from whose bourn
No Traveller returns, Puzzles the will,
And makes us rather bear those ills we have,
Than fly to others that we know not of.

1. The Extremes

As I said above, the extremes are easy. Let’s start with a concrete example. Let ${Y_1,\ldots, Y_n}$ be a sample from ${P\in {\cal P}}$. Suppose we want to estimate ${\theta = T(P)}$; for example, ${T(P)}$ could be the mean of ${P}$.

Bayesian Approach: Put a prior ${\pi}$ on ${P}$. After observing the data ${Y_1,\ldots, Y_n}$ compute the posterior for ${P}$. This induces a posterior for ${\theta}$ given ${Y_1,\ldots, Y_n}$. We can then make statements like

$\displaystyle \pi( \theta\in A|Y_1,\ldots, Y_n) = 1-\alpha.$

The statements are conditional on ${Y_1,\ldots, Y_n}$. There is no question about what to condition on; we condition on all the data.

Frequentist Approach: Construct a set ${C_n = C(Y_1,\ldots, Y_n)}$. We require that

$\displaystyle \inf_{P\in {\cal P}} P^n \Bigl( T(P)\in C_n \Bigr) \geq 1-\alpha$

where ${P^n = P\times \cdots \times P}$ is the distribution corresponding to taking ${n}$ samples from ${P}$. We the call ${C_n}$ a ${1-\alpha}$ confidence set. No conditioning takes place. (Of course, we might want more than just the guarantee in the above equation, like some sort of optimality; but let’s not worry about that here.)

(I notice that Andrew often says that frequentists “condition on ${\theta}$”. I think he means, they do calculations for each fixed ${P}$. At the risk of being pedantic, this is not conditioning. To condition on ${P}$ requires that ${P}$ be a random variable which it is in the Bayesian framework but it is not a random variable in the frequentist framework. But I am probably just nit picking here.)

2. So Why Condition?

Suppose you are taking the frequentist route. Why would you be enticed to condition? Consider the following example from Berger and Wolpert (1988).

I write down a real number ${\theta}$. I then generate two random variables ${Y_1, Y_2}$ as follows:

$\displaystyle Y_1 = \theta + \epsilon_1,\ \ \ Y_2 = \theta + \epsilon_2$

where ${\epsilon_1}$ and ${\epsilon_2}$ and iid and

$\displaystyle P(\epsilon_i = 1) = P(\epsilon_i = -1) = \frac{1}{2}.$

Let ${P_\theta}$ denote the distribution of ${Y_i}$. The set of distributions is ${{\cal P} = \{ P_\theta:\ \theta\in\mathbb{R}\}}$.

I show Fred the frequentist ${Y_1}$ and ${Y_2}$ and he has to infer ${\theta}$. Fred comes up with the following confidence set:

$\displaystyle C(Y_1,Y_1) = \begin{cases} \left\{ \frac{Y_1+Y_2}{2} \right\} & \mbox{if}\ Y_1 \neq Y_2\\ \left\{ Y_1-1 \right\} & \mbox{if}\ Y_1 = Y_2. \end{cases}$

Now, it is easy to check that, no matter what value ${\theta}$ takes, we have that

$\displaystyle P_\theta\Bigl(\theta\in C(Y_1,Y_2)\Bigr) = \frac{3}{4}\ \ \ \mbox{for every}\ \theta\in \mathbb{R}.$

Fred is happy. ${C(Y_1,Y_2)}$ is a 75 percent confidence interval.

To be clear: if I play this game with Fred every day, and I use a different value of ${\theta}$ every day, we will find that Fred traps the true value 75 percent of the time.

Now suppose the data are ${(Y_1,Y_2) = (17,19)}$. Fred reports that his 75 percent confidence interval is ${\{18\}}$. Fred is correct that his procedure has 75 percent coverage. But in this case, many people are troubled by reporting that ${\{18\}}$ is a 75 percent confidence interval. Because with these data, we know that ${\theta}$ must be 18. Indeed, if we did a Bayesian analysis with a prior that puts positive density on each ${\theta}$, he would find that ${\pi(\theta=18|Y_1=17,Y_2=19) = 1}$.

So, we are 100 percent certain that ${\theta = 18}$ and yet we are reporting ${\{18\}}$ as a 75 percent confidence interval.

There is nothing wrong with the confidence interval. It is a procedure, and the procedure comes with a frequency guarantee: it will trap the truth 75 percent of the time. It does not agree with our degrees of belief but no one said it should.

And yet Fred thinks he can retain his frequentist credentials and still do something which intuitively feels better. This is where conditioning comes in.

Let

$\displaystyle A = \begin{cases} 1 & \mbox{if}\ Y_1 \neq Y_2\\ 0 & \mbox{if}\ Y_1 = Y_2. \end{cases}$

The statistic ${A}$ is an ancillary: it has a distribution that does not depend on ${\theta}$. In particular, ${P_\theta(A=1) =P_\theta(A=0) =1/2}$ for every ${\theta}$. The idea now is to report confidence, conditional on ${A}$. Our new procedure is:

If ${A=1}$ report ${C=\{ (Y_1 + Y_2)/2\}}$ with confidence level 1.
If ${A=0}$ report ${C=\{ (Y_1-1\}}$ with confidence level 1/2.

This is indeed a valid conditional confidence interval. Again, imagine we play the game over a long sequence of trials. On the subsequence for which ${A=1}$, our interval contains the true value 100 percent of the time. On the subsequence for which ${A=0}$, our interval contains the true value 50 percent of the time.

We still have valid coverage and a more intuitive confidence interval. Our result is identical the Bayesian answer if the Bayesian uses a flat prior. It is nearly equal to the Bayesian answer if the Bayesian uses a proper but very flat prior.

(This is an example where the Bayesian has the upper hand. I’ve had other examples on this blog where the frequentist does better than the Bayesian. To readers who attach themselves to either camp: remember, there is plenty of ammunition in terms of counterexamples on BOTH sides.)

Another famous example is from Cox (1958). Here is a modified version of that example. I flip a coin. If the coin is HEADS I give Fred ${Y \sim N(\theta,\sigma_1^2)}$. If the coin is TAILS I give Fred ${Y \sim N(\theta,\sigma^2)}$ where ${\sigma_1^2 > \sigma_2^2}$. What should Fred’s confidence interval for ${\theta}$ be?

We can condition on the coin, and report the usual confidence interval corresponding to the appropriate Normal distribution. But if we look unconditionally, over replications of the whole experiment, and minimize the expected length of the interval, you get an interval that has coverage less than ${1-\alpha}$ for HEADS and greater than ${1-\alpha}$ for TAILS. So optimizing unconditionally pulls us away from what seems to be the correct conditional answer.

3. The Problem With Conditioning

There are lots of simple examples like the ones above where, psychologically, it just feels right to condition on something. But simple intuition is misleading. We would still be using Newtonian physics if we went by our gut feelings.

In complex situations, it is far from obvious if we should condition or what we should condition on. Let me review a simplified version of Larry Brown’s (1990) example that I discussed here. You observe
${(X_1,Y_1), \ldots, (X_n,Y_n)}$ where

$\displaystyle Y_i = \beta^T X_i + \epsilon_i,$

${\epsilon_i \sim N(0,1)}$, ${n=100}$ and each ${X_i = (X_{i1},\ldots, X_{id})}$ is a vector of length ${d=100,000}$. Suppose further that the ${d}$ covariates are independent. We want to estimate ${\beta_1}$.

The “best” estimator (the maximum likelihood estimator) is obtained by conditioning on all the data. This means we should estimate the vector ${\beta}$ by least squares. But, the least squares estimator is useless when ${d> n}$.

From the Bayesian point of view we compute we compute the posterior

$\displaystyle \pi\Bigl(\beta_1 \Bigm| (X_1,Y_1),\ldots, (X_n,Y_n)\Bigr)$

which, for such a large ${d}$, will be useless (completely dominated by the prior).

These estimators have terrible behavior compared to the following “anti-conditioning” estimator. Throw away all the covariates except the first one. Now do linear regression using only ${Y}$ and the first covariate. The resulting estimator ${\hat\beta_1}$ is then tightly concentrated around ${\beta_1}$ with high probability. In this example, throwing away data is much better than conditioning on the data. There are some papers on “forgetful Bayesian inference” where one conditions on only part of the data. This is fine but then we are back the the original question: what do we condition on?

There are many other example such as this one.

It would be nice if there was a clear answer such as “you should always condition” or “you should never condition.” But there isn’t. Do a Google Scholar search on conditional inference and you will find an enormous literature. What started as a simple, compelling idea evolved into a complex research area. Much of these conditional methods are very sophisticated and rely on second order asymptotics. But it is rare to see anyone use conditional inference in complex problems, with the exception of Bayesian inference which some will argue goes for a definite, psychologically satisfying answer at the expense of thinking hard about the properties of the resulting procedures.

Unconditional inference is simple and avoids disasters. The cost is that we can sometimes get psychologically unsatisfying answers. Conditional inference yields more psychologically satisfying answers but can lead to procedures with disastrous behavior.

There is no substitute for thinking. Be skeptical of easy answers.

Thus Conscience does make Cowards of us all,
And thus the Native hue of Resolution
Is sicklied o’er, with the pale cast of Thought,

References

Berger, J.O. and Wolpert, R.L. (1988). The likelihood principle, IMS.

Brown, L. D. (1990). An Ancillarity Paradox Which Appears in Multiple Linear Regression. Ann. Statist. 18, 471-493. link to paper.

Cox, D.R. (1958). Some problems connected with statistical inference. The Annals of Mathematical Statistics, 29, 357-372.