## TO CONDITION, OR NOT TO CONDITION, THAT IS THE QUESTION

TO CONDITION, OR NOT TO CONDITION, THAT IS THE QUESTION

Between the completely conditional world of Bayesian inference and the completely unconditional world of frequentist inference lies the hazy world of conditional inference.

The extremes are easy. In Bayesian-land you condition on all of the data. In Frequentist-land, you condition on nothing. If your feet are firmly planted in either of these idyllic places, read no further! Because, conditional inference is:

The undiscovered Country, from whose bourn
No Traveller returns, Puzzles the will,
And makes us rather bear those ills we have,
Than fly to others that we know not of.

1. The Extremes

As I said above, the extremes are easy. Let’s start with a concrete example. Let ${Y_1,\ldots, Y_n}$ be a sample from ${P\in {\cal P}}$. Suppose we want to estimate ${\theta = T(P)}$; for example, ${T(P)}$ could be the mean of ${P}$.

Bayesian Approach: Put a prior ${\pi}$ on ${P}$. After observing the data ${Y_1,\ldots, Y_n}$ compute the posterior for ${P}$. This induces a posterior for ${\theta}$ given ${Y_1,\ldots, Y_n}$. We can then make statements like

$\displaystyle \pi( \theta\in A|Y_1,\ldots, Y_n) = 1-\alpha.$

The statements are conditional on ${Y_1,\ldots, Y_n}$. There is no question about what to condition on; we condition on all the data.

Frequentist Approach: Construct a set ${C_n = C(Y_1,\ldots, Y_n)}$. We require that

$\displaystyle \inf_{P\in {\cal P}} P^n \Bigl( T(P)\in C_n \Bigr) \geq 1-\alpha$

where ${P^n = P\times \cdots \times P}$ is the distribution corresponding to taking ${n}$ samples from ${P}$. We the call ${C_n}$ a ${1-\alpha}$ confidence set. No conditioning takes place. (Of course, we might want more than just the guarantee in the above equation, like some sort of optimality; but let’s not worry about that here.)

(I notice that Andrew often says that frequentists “condition on ${\theta}$”. I think he means, they do calculations for each fixed ${P}$. At the risk of being pedantic, this is not conditioning. To condition on ${P}$ requires that ${P}$ be a random variable which it is in the Bayesian framework but it is not a random variable in the frequentist framework. But I am probably just nit picking here.)

2. So Why Condition?

Suppose you are taking the frequentist route. Why would you be enticed to condition? Consider the following example from Berger and Wolpert (1988).

I write down a real number ${\theta}$. I then generate two random variables ${Y_1, Y_2}$ as follows:

$\displaystyle Y_1 = \theta + \epsilon_1,\ \ \ Y_2 = \theta + \epsilon_2$

where ${\epsilon_1}$ and ${\epsilon_2}$ and iid and

$\displaystyle P(\epsilon_i = 1) = P(\epsilon_i = -1) = \frac{1}{2}.$

Let ${P_\theta}$ denote the distribution of ${Y_i}$. The set of distributions is ${{\cal P} = \{ P_\theta:\ \theta\in\mathbb{R}\}}$.

I show Fred the frequentist ${Y_1}$ and ${Y_2}$ and he has to infer ${\theta}$. Fred comes up with the following confidence set:

$\displaystyle C(Y_1,Y_1) = \begin{cases} \left\{ \frac{Y_1+Y_2}{2} \right\} & \mbox{if}\ Y_1 \neq Y_2\\ \left\{ Y_1-1 \right\} & \mbox{if}\ Y_1 = Y_2. \end{cases}$

Now, it is easy to check that, no matter what value ${\theta}$ takes, we have that

$\displaystyle P_\theta\Bigl(\theta\in C(Y_1,Y_2)\Bigr) = \frac{3}{4}\ \ \ \mbox{for every}\ \theta\in \mathbb{R}.$

Fred is happy. ${C(Y_1,Y_2)}$ is a 75 percent confidence interval.

To be clear: if I play this game with Fred every day, and I use a different value of ${\theta}$ every day, we will find that Fred traps the true value 75 percent of the time.

Now suppose the data are ${(Y_1,Y_2) = (17,19)}$. Fred reports that his 75 percent confidence interval is ${\{18\}}$. Fred is correct that his procedure has 75 percent coverage. But in this case, many people are troubled by reporting that ${\{18\}}$ is a 75 percent confidence interval. Because with these data, we know that ${\theta}$ must be 18. Indeed, if we did a Bayesian analysis with a prior that puts positive density on each ${\theta}$, he would find that ${\pi(\theta=18|Y_1=17,Y_2=19) = 1}$.

So, we are 100 percent certain that ${\theta = 18}$ and yet we are reporting ${\{18\}}$ as a 75 percent confidence interval.

There is nothing wrong with the confidence interval. It is a procedure, and the procedure comes with a frequency guarantee: it will trap the truth 75 percent of the time. It does not agree with our degrees of belief but no one said it should.

And yet Fred thinks he can retain his frequentist credentials and still do something which intuitively feels better. This is where conditioning comes in.

Let

$\displaystyle A = \begin{cases} 1 & \mbox{if}\ Y_1 \neq Y_2\\ 0 & \mbox{if}\ Y_1 = Y_2. \end{cases}$

The statistic ${A}$ is an ancillary: it has a distribution that does not depend on ${\theta}$. In particular, ${P_\theta(A=1) =P_\theta(A=0) =1/2}$ for every ${\theta}$. The idea now is to report confidence, conditional on ${A}$. Our new procedure is:

If ${A=1}$ report ${C=\{ (Y_1 + Y_2)/2\}}$ with confidence level 1.
If ${A=0}$ report ${C=\{ (Y_1-1\}}$ with confidence level 1/2.

This is indeed a valid conditional confidence interval. Again, imagine we play the game over a long sequence of trials. On the subsequence for which ${A=1}$, our interval contains the true value 100 percent of the time. On the subsequence for which ${A=0}$, our interval contains the true value 50 percent of the time.

We still have valid coverage and a more intuitive confidence interval. Our result is identical the Bayesian answer if the Bayesian uses a flat prior. It is nearly equal to the Bayesian answer if the Bayesian uses a proper but very flat prior.

(This is an example where the Bayesian has the upper hand. I’ve had other examples on this blog where the frequentist does better than the Bayesian. To readers who attach themselves to either camp: remember, there is plenty of ammunition in terms of counterexamples on BOTH sides.)

Another famous example is from Cox (1958). Here is a modified version of that example. I flip a coin. If the coin is HEADS I give Fred ${Y \sim N(\theta,\sigma_1^2)}$. If the coin is TAILS I give Fred ${Y \sim N(\theta,\sigma^2)}$ where ${\sigma_1^2 > \sigma_2^2}$. What should Fred’s confidence interval for ${\theta}$ be?

We can condition on the coin, and report the usual confidence interval corresponding to the appropriate Normal distribution. But if we look unconditionally, over replications of the whole experiment, and minimize the expected length of the interval, you get an interval that has coverage less than ${1-\alpha}$ for HEADS and greater than ${1-\alpha}$ for TAILS. So optimizing unconditionally pulls us away from what seems to be the correct conditional answer.

3. The Problem With Conditioning

There are lots of simple examples like the ones above where, psychologically, it just feels right to condition on something. But simple intuition is misleading. We would still be using Newtonian physics if we went by our gut feelings.

In complex situations, it is far from obvious if we should condition or what we should condition on. Let me review a simplified version of Larry Brown’s (1990) example that I discussed here. You observe
${(X_1,Y_1), \ldots, (X_n,Y_n)}$ where

$\displaystyle Y_i = \beta^T X_i + \epsilon_i,$

${\epsilon_i \sim N(0,1)}$, ${n=100}$ and each ${X_i = (X_{i1},\ldots, X_{id})}$ is a vector of length ${d=100,000}$. Suppose further that the ${d}$ covariates are independent. We want to estimate ${\beta_1}$.

The “best” estimator (the maximum likelihood estimator) is obtained by conditioning on all the data. This means we should estimate the vector ${\beta}$ by least squares. But, the least squares estimator is useless when ${d> n}$.

From the Bayesian point of view we compute we compute the posterior

$\displaystyle \pi\Bigl(\beta_1 \Bigm| (X_1,Y_1),\ldots, (X_n,Y_n)\Bigr)$

which, for such a large ${d}$, will be useless (completely dominated by the prior).

These estimators have terrible behavior compared to the following “anti-conditioning” estimator. Throw away all the covariates except the first one. Now do linear regression using only ${Y}$ and the first covariate. The resulting estimator ${\hat\beta_1}$ is then tightly concentrated around ${\beta_1}$ with high probability. In this example, throwing away data is much better than conditioning on the data. There are some papers on “forgetful Bayesian inference” where one conditions on only part of the data. This is fine but then we are back the the original question: what do we condition on?

There are many other example such as this one.

It would be nice if there was a clear answer such as “you should always condition” or “you should never condition.” But there isn’t. Do a Google Scholar search on conditional inference and you will find an enormous literature. What started as a simple, compelling idea evolved into a complex research area. Much of these conditional methods are very sophisticated and rely on second order asymptotics. But it is rare to see anyone use conditional inference in complex problems, with the exception of Bayesian inference which some will argue goes for a definite, psychologically satisfying answer at the expense of thinking hard about the properties of the resulting procedures.

Unconditional inference is simple and avoids disasters. The cost is that we can sometimes get psychologically unsatisfying answers. Conditional inference yields more psychologically satisfying answers but can lead to procedures with disastrous behavior.

There is no substitute for thinking. Be skeptical of easy answers.

Thus Conscience does make Cowards of us all,
And thus the Native hue of Resolution
Is sicklied o’er, with the pale cast of Thought,

References

Berger, J.O. and Wolpert, R.L. (1988). The likelihood principle, IMS.

Brown, L. D. (1990). An Ancillarity Paradox Which Appears in Multiple Linear Regression. Ann. Statist. 18, 471-493. link to paper.

Cox, D.R. (1958). Some problems connected with statistical inference. The Annals of Mathematical Statistics, 29, 357-372.

1. oz
Posted January 6, 2013 at 2:19 pm | Permalink

Thanx for the nice post. Two comments:
1. Cox example seems different from the first example by Berger and Wolpert.
In the first example, you see A, so you can condition on it. In the second example, you don’t observe the event you need to condition on. Or did you mean that Fred tells you the result of the coin flip? (in that case it seems obvious, at least to me, that you should condition on the result).

2. In the second case, the ML estimator is known to be ‘only’ asymptotically optimal, but since for d>n is so far from the asymptotic regime, no wonder that another estimator would perform better.
I’m a bit more confused about the bayesian estimator. It seems to me that the issue is not bayesian vs. not bayesian, but whether or not to look at one variable or all. You could put a prior only on beta_1 and get a bayesian estimator which will converge rapidly to the true value.
Also, suppose that the set of beta is indeed generated from the prior which you assume. In this case, will the simple estimator steel beat the bayesian estimator (the latter should be optimal in this case, no?)

• Posted January 6, 2013 at 2:54 pm | Permalink

In the Cox example you do see the coin flip.

In the regression case, you refer t drawing beta from a prior.
The beta is not drawn fro any prior.
Note that if you do a Bayes analysis with a flat prior you get the
least squares estimator.

2. Posted January 6, 2013 at 2:24 pm | Permalink

That’s why some kind of long run coverage alone may not suffice to answer questions of relevance in particular cases. In the kind of example of your #2, the statistic is incomplete. In Cox and Mayo (2010), we try to identify a rationale: http://www.phil.vt.edu/dmayo/personal_website/ch%207%20cox%20&%20mayo.pdf
Aris Spanos has a different treatment. Dashing, so I may be missing something.

3. Jonathan Rosenblatt
Posted January 7, 2013 at 3:13 am | Permalink

Another example would be the case of CIs after selection (say, using a testing approach).

4. Danny Runold
Posted January 24, 2013 at 3:38 am | Permalink

This is more of a question for which I hope you could write a post. If you have a population prior, with an infinite number of individuals in the population, how can a few new measurements via Bayes’ rule change such a reliable prior? Or are frequentists wrong in saying there is such population information? I studied frequentist (test) statistics in psychology for 4 years and Bayesian statistics in AI for 5, but this is a practicality I can’t really get my head around. I know that in the end Bayesian and frequentist statistics reconcile about this but it would be nice to get some insights from an expert, if you know what I mean and if you’re interested of course, thanks.

• Posted January 24, 2013 at 8:26 am | Permalink

I am not sure I understand your question.
Can you give a bit more detail?

5. C T
Posted January 31, 2013 at 6:06 pm | Permalink

The confidence set should be C(Y_1,Y_2), na?