Double Misunderstandings About p-values

It’s been said a million times and in a million places that a p-value is not the probability of {H_0} given the data.

But there is a different type of confusion about p-values. This issue arose in a discussion on Andrew’s blog.

Andrew criticizes the New York times for giving a poor description of the meaning of p-values. Of course, I agree with him that being precise about these things is important. But, in reading the comments on Andrew’s blog, it occurred to me that there is often a double misunderstanding.

First, let me way that I am neither defending nor criticizing p-values in this post. I am just going to point out that there are really two misunderstandings floating around.

Two Misunderstandings

(1) The p-value is not the probability of {H_0} given the data.

(2) But neither is the p-value the probability of something conditional on {H_0}.

Deborah Mayo pointed this fact out in the discussion on Andrew’s blog (as did a few other people).

When we use p-values we are in frequentist-land. {H_0} (the null hypothesis) is not a random variable. It makes no sense to talk about the posterior probability of {H_0}. But it also makes no sense to talk about conditioning on {H_0}. You can only condition on things that were random in the first place.

Let me get more specific. Let {Z} be a test statistic and let {z} be the realized valued of {Z}. The p-value (in a two-sided test) is

\displaystyle  p = P_0(|Z| > |z|)

where {P_0} is the null distribution. It is not equal to {P\bigl(|Z|> |z| \, \bigm| \,H_0\bigr)}. This makes no sense. {H_0} is not a random variable. In case the null consists of a set {{\cal P}_0} of distributions, the p-value is

\displaystyle  p = \sup_{P\in {\cal P}_0}P(|Z| > |z|).

You could accuse me of being pedantic here or of being obsessed with notation. But given the amount of confusion about p-values, I think it is important to get it right.

More Misunderstandings

The same problem occurs when people write {p(x|\theta)}. When I teach Bayes, I do write the model as {p(x|\theta)}. When I teach frequentist statistics, I write this either as {p(x;\theta)} or {p_\theta(x)}. There is no conditioning going on. To condition on {\theta} would require a joint distribution for {(x,\theta)}. There is no such joint distribution in frequentist-land.

The coverage of a confidence interval {C(X_1,\ldots, X_n)} is not the probability that {C(X_1,\ldots, X_n)} traps {\theta} conditional on {\theta}. The frequentist coverage is

\displaystyle  {\rm Coverage} = \inf_{\theta} P_\theta\Bigl(\theta\in C(X_1,\ldots, X_n)\Bigr).

Again, there is no conditioning going on.

Conclusion

I understand that people often say “conditional on {\theta}” to mean “treating {\theta} as fixed.” But if we want to eradicate misunderstandings about statistics, I think it would help if we were more careful about how we choose our words.

The Gap

No, not the store. I am referring to the gap between the invention of a method and the theoretical justification for that method.

It is hard to quantify the gap because, for any method, there is always dispute about who invented (discovered?) it, and who nailed the theory. For example, maximum likelihood is usually credited to Ronald Fisher but one could argue that it was really invented by Gauss or Laplace. And who provided the theory for maximum likelihood? Well, most people would say Fisher. But you could also argue that Le Cam was the one who really made it rigorous.

So, with these caveats in my mind, here are a few examples of the gap. I am interested to hear your examples (as well as your opinions about my examples.)

METHOD INVENTOR THEORETICAL
JUSTIFICATION
Maximum Likelihood Fisher 1912 Fisher 1922
The Bootstrap Efron 1979 Gine and Zinn 1990
The Lasso Tibshirani 1996 Greenshtein and Ritov 2004
False Discovery Rate Schweder, Spotvoll 1982 Benjamini and Hochberg 1995
Wavelet Denoising Haar 1905 Donoho and Johnstone 1998
Cross-validation Stone 1974 Gyorfi, Kohler, Krzyzak,
and Walk 2002
Causal Inference Rubin 1974 Rubin 1974, Robins 1989
(Counterfactual Version)
Causal Inference Wright 1934 Spirtes, Glymour, Scheines 1987
(Graphical Version) and Pearl and Verma 1991

Every item in the above list is debatable. For example, I consider Gine amd Zinn (1990) the first definitive result that clarifies necessary and sufficient conditions to justify the bootstrap. But one could well argue that earlier papers Bickel and Freedman (1981) and Singh (1981) deserve that label.

What items do you disagree with?

What items would you add?

Partial List of References

John Aldrich. (1997). R.A. Fisher and the making of maximum likelihood 1912-1922. Statist. Sci., 12, 162-176.

Donoho, David L and Johnstone, Iain M. Minimax estimation via wavelet shrinkage. The Annals of Statistics, 26, 879-921.

Efron, Bradley. (1997). Bootstrap methods: another look at the jackknife. The annals of Statistics, 7, 1-26.

Gine, Evarist and Zinn, Joel. (1990). Bootstrapping general empirical measures. The Annals of Probability, 851–869.

Greenshtein, Eitan and Ritov, Ya’Acov. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli, 10, 971-988.

Gy{ö}rfi, L{á}szl{ó} and Kohler, Michael and Krzyzak, Adam and Walk, Harro. (2002). A distribution-free theory of nonparametric regression, Springer.

Tibshirani, Robert. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B, 267–288.

P.S. Thanks to Ryan Tibshirani and David Choi for discussions on this topic.

The Other B-Word

Christian has a fun post about the rise of the B-word (Bayesian). “Bayesian ” kills “frequentist.”

Well, how about the other B-word, “Bootstrap.” Look at this Google-trends plot:

BvsB

The bootstrap demolishes Bayes!

Actually, Christian’s post was tongue-in-cheek. As he points out, “frequentist” is … not a qualification used by frequentists to describe their methods. In other words (!), “frequentist” does not occur very often in frequentist papers.

But all joking aside, that does raise an interesting question. Why do Bayesians put the word “Bayesian” in the title of their papers? For example, you might a see a paper with a title like

“A Bayesian Analysis of Respiratory Diseases in Children”

but you would be unlikely to see a paper with a title like

“A Frequentist Analysis of Respiratory Diseases in Children.”

In fact, I think you are doing a disservice to Bayesian inference if you include “Bayesian” in the title. Allow me to explain.

The great Bayesian statistician Dennis Lindley argued strongly against creating a Bayesian journal. He argued that if Bayesian inference is to be successful and become part of the mainstream of statistics, then it should not be treated as novel. Having a Bayesian journal comes across as defensive. Be bold and publish your papers in our best journals, he argued. In other words, if you really believe in the power of Bayesian statistics, then remove the word Bayesian and just think of it as statistics.

I think the same argument applies to paper titles. If you think Bayesian inference is the right way to analyze respiratory diseases in children, then write a paper entitled:

“A Statistical Analysis of Respiratory Diseases in Children.”

Qualifying the title with the word “Bayesian” suggests that there is something novel or weird about using Bayes. If you believe in Bayes, have the courage to leave it out of the title.

Rise of the Machines

The Committee of Presidents of Statistical Societies (COPSS) is celebrating its 50th Anniversary. They have decided to to publish a collection and I was honored to be invited to contribute. The theme of the book is Past, Present and Future of Statistical Science.

My paper, entitled Rise of the Machines, can be found here.

To whet your appetite, here is the beginning of the paper.

RISE OF THE MACHINES
Larry Wasserman

terminator

On the 50th anniversary of the Committee of Presidents of Statistical Societies I reflect on the rise of the field of Machine Learning and what it means for Statistics. Machine Learning offers a plethora of new research areas, new applications areas and new colleagues to work with. Our students now compete with Machine Learning students for jobs. I am optimistic that visionary Statistics departments will embrace this emerging field; those that ignore or eschew Machine Learning do so at their own risk and may find themselves in the rubble of an outdated, antiquated field.

1. Introduction

Statistics is the science of learning from data. Machine Learning (ML) is the science of learning from data. These fields are identical in intent although they differ in their history, conventions, emphasis and culture.

There is no denying the success and importance of the field of Statistics for science and, more generally, for society. I’m proud to be a part of the field. The focus of this essay is on one challenge (and opportunity) to our field: the rise of Machine Learning.

During my twenty-five year career I have seen Machine Learning evolve from being a collection of rather primitive (yet clever) set of methods to do classification, to a sophisticated science that is rich in theory and applications.

A quick glance at the The Journal of Machine Learning Research (\url{mlr.csail.mit.edu}) and NIPS (\url{books.nips.cc}) reveals papers on a variety of topics that will be familiar to Statisticians such as: conditional likelihood, sequential design, reproducing kernel Hilbert spaces, clustering, bioinformatics, minimax theory, sparse regression, estimating large covariance matrices, model selection, density estimation, graphical models, wavelets, nonparametric regression. These could just as well be papers in our flagship statistics journals.

This sampling of topics should make it clear that researchers in Machine Learning — who were at one time somewhat unaware of mainstream statistical methods and theory — are now not only aware of, but actively engaged in, cutting edge research on these topics.

On the other hand, there are statistical topics that are active areas of research in Machine Learning but are virtually ignored in Statistics. To avoid becoming irrelevant, we Statisticians need to (i) stay current on research areas in ML and (ii) change our outdated model for disseminating knowledge and (iii) revamp our graduate programs.

The rest of the paper can be found here.

Statistics Declares War on Machine Learning!

STATISTICS DECLARES WAR ON MACHINE LEARNING!

Well I hope the dramatic title caught your attention. Now I can get to the real topic of the post, which is: finite sample bounds versus asymptotic approximations.

In my last post I discussed Normal limiting approximations. One commenter, Csaba Szepesvari, wrote the following interesting comment:

What still surprises me about statistics or the way statisticians do their business is the following: The Berry-Esseen theorem says that a confidence interval chosen based on the CLT is possibly shorter by a good amount of {c/\sqrt{n}}. Despite this, statisticians keep telling me that they prefer their “shorter” CLT-based confidence intervals to ones derived by using finite-sample tail inequalities that we, “machine learning people prefer” (lies vs. honesty?). I could never understood the logic behind this reasoning and I am wondering if I am missing something. One possible answer is that the Berry-Esseen result could be oftentimes loose. Indeed, if the summands are normally distributed, the difference will be zero. Thus, an interesting question is the study of the behavior of the worst-case deviation under assumptions (or: for what class of distributions is the Berry-Esseen result tight?). It would be interesting to read about this, or in general, why statisticians prefer to possibly make big mistakes to saying “I don’t know” more often.

I completely agree with Csaba (if I may call you by your first name).

A good example is the binomial. There are simple confidence intervals for the parameter of a binomial that have correct, finite sample coverage. These finite sample confidence sets {C_n} have the property that

\displaystyle  P(\theta\in C_n) \geq 1-\alpha

for every sample size {n}. There is no reason to use the Normal approximation.

And yet …

I want to put on my Statistics hat and remove my Machine Learning hat and defend my fellow statisticians for using Normal approximations.

Open up a random issue of Science or Nature. There are papers on physics, biology, chemistry and so on, filled with results that look like, for example:

{114.21 \pm 3.68}

These are usually point estimates with standard errors based on asymptotic Normal approximations.

The fact is, science would come to a screeching halt without Normal approximations. The reason is that finite sample intervals are available in only a tiny fraction of statistical problems. I think that Machine Learners (remember, I am wearing my Statistics hat today) have a skewed view of the universe of statistical problems. The certitude of finite sample intervals is reassuring but it rules out the vast majority of statistics as applied to real scientific problems.

Here is a random list, off the top of my head, of problems where all we have are confidence intervals based on asymptotic Normal approximations:

(Most) maximum likelihood estimators, robust estimators, time series, nonlinear regression, generalized linear models, survival analysis (Cox proportional hazards model), partial correlations, longitudinal models, semiparametric inference, confidence bands for function estimators, population genetics, etc.

Biostatistics and cancer research would not exist without confidence intervals based on Normal approximations. In fact, I would guess that every medication or treatment you have ever had, was based on medical research that used these approximations.

When we use approximate confidence intervals, we always proceed knowing that they are approximations. In fact, the entire scientific process is based on numerous, hard to quantify, approximations.

Even finite sample intervals are only approximations. They assume that the data are iid draws from a distribution which is a convenient fiction. It is rarely exactly true.

Let me repeat: I prefer to use finite sample intervals when they are available. And I agree with Csaba that it would be interesting to state more clearly the conditions under which we can tighten the Berry-Esseen bounds. But approximate confidence intervals are a successful, indispensable part of science.

P.S. We’re comparing two frequentist ideas here: finite sample confidence intervals versus asymptotic confidence intervals. Die hard Bayesians will say: Bayes is always a finite sample method. Frequentists will reply: no, you’re just sweeping the complications under the rug. Let’s refrain from the Bayes vs Frequentist arguments for this post. The question being discussed is: given that you are doing frequentist inference, what are your thoughts on finite sample versus asymptotic intervals?

How Close Is The Normal Distribution?

HOW CLOSE IS THE NORMAL DISTRIBUTION?

One of the first things you learn in probability is that the average {\overline{X}_n} has a distribution that is approximately Normal. More precisely, if {X_1,\ldots, X_n} are iid with mean {\mu} and variance {\sigma^2} then

\displaystyle  Z_n \rightsquigarrow N(0,1)

where

\displaystyle  Z_n = \frac{\sqrt{n}(\overline{X}_n - \mu)}{\sigma}

and {\rightsquigarrow} means “convergence in distribution.”

1. How Close?

But how close is the distribution of {Z_n} to the Normal? The usual answer is given by the Berry-Esseen theorem which says that

\displaystyle  \sup_t |P(Z_n \leq t) - \Phi(t)| \leq \frac{0.4784 \,\beta_3}{\sigma^3 \sqrt{n}}

where {\Phi} is the cdf of a Normal(0,1) and {\beta_3 = \mathbb{E}(|X_i|^3)}. This is good news; the Normal approximation is accurate and so, for example, confidence intervals based on the Normal approximation can be expected to be accurate too.

But these days we are often interested in high dimensional problems. In that case, we might be interested, not in one mean, but in many means. Is there still a good guarantee for closeness to the Normal limit?

Consider random vectors {X_1,\ldots, X_n\in \mathbb{R}^d} with mean vector {\mu} and covariance matrix {\Sigma}. We’d like to say that {\mathbb{P}(Z_n \in A)} is close to {\mathbb{P}(Z \in A)} where {Z_n = \Sigma^{-1/2}(\overline{X}_n - \mu)} and {Z\sim N(0,I)}. We allow the dimension {d=d_n} grow with {n}.

One of the best results I know of is due to Bentkus (2003) who proved that

\displaystyle  \sup_{A\in {\cal A}} | \mathbb{P}(Z_n \in A) - \mathbb{P}(Z \in A) | \leq \frac{400\, d^{1/4} \beta}{\sqrt{n}}

where {{\cal A}} is the class of convex sets and {\beta = \mathbb{E} ||X||^3}. We expect that {\beta = C d^{3/2}} so the error is of order {O(d^{7/4}/\sqrt{n})}. This means that we must have {d = o(n^{2/7})} to make the error go to 0 as {n\rightarrow\infty}.

2. Ramping Up The Dimension

So far we need {d^{7/2}/n \rightarrow 0} to justify the Normal approximation which is a serious restriction. Most of the current results in high dimensional inference, such as the lasso, do not place such as severe restriction on the dimension. Can we do better than this?

Yes. Right now we are witnessing a revolution in Normal approximations thanks to Stein’s method link.
This is a method for bounding the distance from Normal approximations invented by Charles Stein in 1972.

Although the method is 40 years old, there has recently been an explosion of interest in the method. Two excellent references are the book by Chen, Goldstein and Shao (2012) and the review article by Nathan Ross which can be found here.

An example of the power of this method is the very recent paper by Victor Chernozhukov, Denis Chetverikov and Kengo Kato. They showed that, if we restrict {{\cal A}} to rectangles rather than convex sets, then

\displaystyle  \sup_{A\in {\cal A}} | \mathbb{P}(Z_n \in A) - \mathbb{P}(Z \in A) | \rightarrow 0

as long as {(\log d)^7/n \rightarrow 0}. (In fact, they use a lot of tricks besides Stein’s method but Stein’s method plays a key role).

This is an astounding improvement. We only need {d} to be smaller than {e^{n^{1/7}}} instead of {n^{2/7}}.

The restriction to rectangles is not so bad; it leads immediately to a confidence rectangle for the mean, for example. The authors show that their results can be used to derive further results for bootstrapping, for high-dimensional regression and for hypothesis testing.

I think we are seeing the beginning of a new wave of results on high dimensional Berry-Esseen theorems. I will do a post in the future on Stein’s method.

References

Bentkus, Vidmantas. (2003). On the dependence of the Berry-Esseen bound on dimension. Journal of Statistical Planning and Inference, 385-402.

Chen, Louis Goldstein, Larry and Shao, Qi-Man. (2010). Normal approximation by Stein’s method. Springer.

Victor Chernozhukov, Denis Chetverikov and Kengo Kato. (2012). Central Limit Theorems and Multiplier Bootstrap when p is much larger than n. http://arxiv.org/abs/1212.6906.

Ross, Nathan. (2011). Fundamentals of Stein’s method. Probability Surveys, 8, 210-293.

Stein, Charles. (1986), Approximate computation of expectations. Lecture Notes-Monograph Series 7.

Bootstrapping and Subsampling: Part II

BOOTSTRAPPING AND SUBSAMPLING: PART II

In part I we discussed the bootstrap and we saw that sometimes it does not work. By “doesn’t work” I mean that the coverage of the nominal {1-\alpha} level confidence intervals does not approach {1-\alpha} as the sample size increases. In this post I’ll discuss subsampling which works under broader conditions than the bootstrap.

1. Subsampling

Suppose that {X_1,\ldots, X_n \sim P}. Let {\hat\theta_n = g(X_1,\ldots,X_n)} be some estimator of {\theta = T(P)}.

We need to make one main assumption:

Assume that {n^\beta (\hat\theta_n - \theta)} converges in distribution to some continuous, non-degenerate distribution {J}.

We do not need to know {J}; we only need to know that such a {J} exists. We don’t even need to know {\beta} but for simplicity I’ll assume that {\beta=1/2}. So our assumption is that {\sqrt{n} (\hat\theta_n - \theta)} converges to some {J}. Here is the subsampling algorithm.

  1. Choose a number {b = b_n} such that

    \displaystyle  b_n \rightarrow \infty,\ \ \ \ \frac{b_n}{n}\rightarrow 0

    as {n\rightarrow\infty}.

  2. Draw {b} observations (without replacement) from the data {\{X_1,\ldots, X_n\}} This is one subsample. Repeat this {N} times. So we have {N} subsamples, each of size {b}. Denote these subsamples by {S_1,\ldots, S_N}.
  3. From each subsample {S_j} we compute the estimator. This yields {N} values

    \displaystyle  \hat\theta_b^1,\ldots, \hat\theta_b^N.

    Note that each estimator is based on one subsample of size {b}.

  4. Now define

    \displaystyle  L_n(t) = \frac{1}{N}\sum_{j=1}^N I\Bigl( \sqrt{b} (\hat\theta_b^j - \hat\theta_n) \leq t\Bigr)

    where {I} denotes the indicator function.

  5. Let {\hat t_{\alpha/2}} and {\hat t_{1-\alpha/2}} be the {\alpha/2} and {1-\alpha/2} quantiles of {L_n}:

    \displaystyle  \hat t_{\alpha/2} = L_n^{-1}(\alpha/2),\ \ \ \hat t_{1-\alpha/2} = L_n^{-1}(1-\alpha/2).

  6. Define the confidence interval

    \displaystyle  C_n = \left[\hat\theta_n - \frac{\hat t_{1-\alpha/2}}{\sqrt{n}},\ \hat\theta_n - \frac{\hat t_{\alpha/2}}{\sqrt{n}}\right].

Theorem:

\displaystyle  \mathop{\mathbb P}(\theta\in C_n)\rightarrow 1-\alpha

as {n\rightarrow\infty}.

Note: There are {N = \binom{n}{b}} possible subsamples of size {b}. In practice it is not necessary to use all of these subsamples. Usually, one selects, say, {N=1,000} subsamples at random.

2. Why It Works

Here is an outline of the proof of the Theorem. (See Chapter 2 of Politis, Romano and Wolf 1999 for details.) Define

\displaystyle  J_n(x) = \mathbb{P}( \sqrt{n}(\hat\theta_n - \theta) \leq x).

By assumption {J_n(x) = J(x) + o(1)}. Also, {J_b(x) = J(x) + o(1)} since {b\rightarrow\infty} as {n\rightarrow\infty}. The key fact is that {L_n(x)-J_b(x)\stackrel{P}{\rightarrow}0} as {n\rightarrow \infty}. To see this, note that

\displaystyle  \sqrt{b} (\hat\theta_b - \hat\theta_n) = \sqrt{b} (\hat\theta_b - \theta) + \sqrt{\frac{b}{n}}\ \sqrt{n}(\theta-\hat\theta_n)= \sqrt{b} (\hat\theta_b - \theta) + R_n

where {R_n= o_P(1)}. This follows since {b/n \rightarrow 0} and since {\sqrt{n}(\theta-\hat\theta_n)} is converging in distribution. So,

\displaystyle  L_n(x) \approx U(x)

where

\displaystyle  U(x) = \frac{1}{N}\sum_{j=1}^N I( \sqrt{b} (\hat\theta_b^j - \theta) \leq x).

By Hoeffding’s inequality for U-statistics, we have that, for every {\epsilon>0},

\displaystyle  \mathbb{P} ( |U(x) - J_b(x)| > \epsilon) \leq 2 \exp\left( - \frac{2 n \epsilon^2}{b}\right)

which goes to 0 since {n/b\rightarrow \infty} as {n\rightarrow\infty}. So

\displaystyle  L_n(x) \approx U(x) \approx J_b(x) \approx J_n(x) \approx J(x).

The coverage of {C_n} is

\displaystyle  P\left(\hat\theta_n - \frac{\hat t_{1-\alpha/2}}{\sqrt{n}} \leq \theta \leq \hat\theta_n - \frac{\hat t_{\alpha/2}}{\sqrt{n}}\right) = P(\hat{t}_{\alpha/2} \leq \sqrt{n}(\hat\theta_n-\theta) \leq \hat{t}_{1-\alpha/2})= J_n(\hat t_{1-\alpha/2}) - J_n(\hat t_{\alpha/2}).

But

\displaystyle  J_n(\hat t_{1-\alpha/2}) - J_n(\hat t_{\alpha/2}) \approx L_n(\hat t_{1-\alpha/2}) - L_n(\hat t_{\alpha/2}) = \left(1-\frac{\alpha}{2}\right) - \frac{\alpha}{2} = 1-\alpha.

The amazing thing about subsampling is that there are very few regularity conditions. It works under almost no conditions. This is in contrast to the bootstrap which does require fairly strong conditions to yield valid confidence intervals.

3. What’s the Catch?

There are a few catches. First, you need to choose {b}. The only requirement is that {b/n\rightarrow 0} and {b\rightarrow\infty} as {n\rightarrow\infty}. It’s great that it works for such a large range of values but we need a way to pick {b}.

Fortunately, there are methods for choosing {b}. I won’t go into detail, rather, I’ll point you to Chapter 9 of Politis, Romano, and (1999) and also Bickel and Sakov (2008).

A more serious problem is that the confidence guarantee is only an asymptotic guarantee. However, this is true of many statistical methods. I prefer to use methods with finite sample guarantees whenever possible but sometimes there is no choice but to resort to large sample approximations. Come to think of it, a good topic for a future blog post would be: for which problems are there accurate methods with finite sample guarantees?

Another concern about subsampling and bootstrapping is that they are computationally intensive. For a recent and very interesting approach to dealing with the computational burden, see this recent paper by Ariel Kleiner, Ameet Talwalkar, Purnamrita Sarkar and Mike Jordan.

By the way, see this paper by Joe Romano and Azeem Shaikh for some recent theoretical advances.

4. Conclusion

The bootstrap and subsampling are powerful techniques for constructing confidence intervals. Subsampling works under weaker conditions and relies on simpler theory. However, subsampling requires choosing a tuning parameter (the size {b} of the subsamples) which probably explains why it is less popular. I think that if we had a fast method for automatically choosing {b}, then subsampling would replace bootstrapping as the method of choice.

References

Bickel, P.J. and Sakov, A. (2008). On the choice of m in the m out of n bootstrap and confidence bounds for extrema. Statistica Sinica, 18, 967-985.

Politis, D.N. and Romano, J.P. and Wolf, M. (1999). Subsampling, Springer Verlag.

Bootstrapping and Subsampling: Part I

BOOTSTRAPPING AND SUBSAMPLING: PART I

Bootstrapping and subsampling are in the “amazing” category in statistics. They seem much more popular in statistics than machine learning for some reason.

1. The Bootstrap

The bootstrap (a.k.a. the shotgun) was invented by Brad Efron. Here is how it works. We have data {X_1,\ldots, X_n \sim P} and we want a confidence interval for {\theta = T(P)}. For example, {\theta} could be the median of {P} or the mean of {P} or something more complicated like, the largest eigenvalue of the covariance matrix of {P}.

The {1-\alpha} bootstrap confidence interval is

\displaystyle  C_n = \left[ \hat\theta - \frac{\hat t_{1-\alpha/2}}{\sqrt{n}},\ \hat\theta - \frac{\hat t_{\alpha/2}}{\sqrt{n}}\right]

where {\hat\theta} is an estimator of {\theta} and {\hat t_{\alpha/2}} and {\hat t_{1-\alpha/2}} are sample bootstrap quantiles that I will describe below. Before I explain this in more detail, notice two things. First, there is a minus sign in both the lower and upper endpoint. Second, the {\alpha/2} and {1-\alpha/2} quantiles are in the upper and lower endpoints, the reverse of what you might expect. The reason for the strange looking interval will be clear when we derive the interval.

Now for some details. Think of the parameter of interest {\theta} as a function of the unknown distribution, which is why we write it as {\theta = T(P)}. Let {P_n} denote the empirical distribution:

\displaystyle  P_n(A) = \frac{1}{n}\sum_{i=1}^n I_A(X_i).

In other words, {P_n} is the distribution that puts mass {1/n} at each {X_i}.

The estimator is just the function {T} applied to {P_n}, that is, {\hat\theta = T(P_n)}. For example, if {\theta} is the median of {P} then {\hat\theta} is the median of {P_n} which is just the sample median.

Now let

\displaystyle  R_n = \sqrt{n}(\hat\theta - \theta).

We use {R_n} because typically it converges in distribution to some well-defined distribution (such as a Normal). Now let {H_n} denote the (unknown) distribution of {R_n}:

\displaystyle  H_n(t) = \mathbb{P}(R_n \leq t).

Suppose, for a moment, that we did know {H_n}. We could then find the {\alpha/2} quantile {t_{\alpha/2}} and the {1-\alpha/2} quantile {t_{1-\alpha/2}}, namely,

\displaystyle  t_{\alpha/2} = H_n^{-1}(\alpha/2)\ \ \ \mbox{and}\ \ \ t_{1-\alpha/2} = H_n^{-1}(1-\alpha/2).

It follows that

\displaystyle  \mathbb{P}( t_{\alpha/2} \leq R_n \leq t_{1-\alpha/2}) = 1- \alpha.

Continuing with the fantasy that we know {H_n}, define

\displaystyle  \Omega_n = \left[ \hat\theta - \frac{t_{1-\alpha/2}}{\sqrt{n}},\ \hat\theta - \frac{t_{\alpha/2}}{\sqrt{n}}\right].

Now I will show you that {\Omega_n} is an exact {1-\alpha} confidence interval. This follows since

\displaystyle  \mathbb{P}(\theta \in \Omega_n) = \mathbb{P}\left(\hat\theta - \frac{t_{1-\alpha/2}}{\sqrt{n}} \leq \theta \leq \hat\theta - \frac{t_{\alpha/2}}{\sqrt{n}}\right)= \mathbb{P}(t_{\alpha/2} \leq R_n \leq t_{1-\alpha/2}) = H_n(t_{1-\alpha/2}) - H_n(t_{\alpha/2})= (1-\alpha/2) - \alpha/2 = 1-\alpha.

We engineered {\Omega_n} so that the last line would be exactly {1-\alpha}. The strange form of {\Omega_n} is explained by the fact that we really have a probability statement for {R_n} which we then manipulate into the form of an interval for {\theta}. (You can check that if {H_n} were standard Gaussian, then using the symmetry of the Gaussian, the interval could be re-written in the more familiar looking form {\hat\theta\pm z_{\alpha/2}/\sqrt{n}} where {z_{\alpha/2}} is the upper-tail quantile of a Normal.)

The problem is that we don’t know {H_n} and hence we don’t know {t_{\alpha/2}} or {t_{1-\alpha/2}}. The bootstrap is a method for approximating {H_n}. Let {B} be a large number (for example, {B=100,000}.) Now do this:

  1. Draw {n} observations {X_1^*,\ldots, X_n^*} from {P_n} and compute {\hat\theta^*} from these new data.
  2. Repeat step 1 {B} times yielding values {\hat\theta_1^*,\ldots, \hat\theta_B^*}.
  3. Approximate {H_n} with

    \displaystyle  \hat H_n(t) = \frac{1}{B}\sum_{j=1}^B I\Bigl( \sqrt{n}(\hat\theta_j^* - \hat\theta)\leq t\Bigr)

    where {I} denotes the indicator function.

  4. Find the quantiles {\hat t_{\alpha/2}} and {\hat t_{1-\alpha/2}} of {\hat H_n} and construct {C_n} as defined earlier.

The interval {C_n} is the same as {\Omega_n} except we use the estimated quantiles for {C_n}. What we are doing here is estimating {H_n} by using {P_n} as an estimate of {P}. (That’s why we draw {X_1^*,\ldots, X_n^*} from {P_n}.) If {\hat H_n} is close to {H_n} then {\hat t_{\alpha/2} \approx t_{\alpha/2}} and {\hat t_{1-\alpha/2} \approx t_{1-\alpha/2}} and then {C_n \approx \Omega_n}.

There are two sources of error. First we approximate

\displaystyle  H_n(t)=\mathbb{P}(\sqrt{n} (\hat\theta - \theta))

with

\displaystyle  H_n^*(t)=\mathbb{P}_n(\sqrt{n} (\hat\theta^* - \hat\theta)).

Essentially, we are replacing {P} with {P_n}. Second, we are approximating {H_n^*(t)} with

\displaystyle  \hat H_n(t) = \frac{1}{B}\sum_{j=1}^B I\Bigl( \sqrt{n}(\hat\theta_j^* - \hat\theta)\leq t\Bigr).

This second source of error is negligible because we can make {B} as large as we want.

Remark: A moment’s reflection should convince you that drawing a sample of size {n} from {P_n} is the same as drawing {n} points with replacement from the original data. This is how the bootstrap is often described but I think it is clearer to describe it as drawing {n} observations from {P_n}.

2. Why Does It Work?

If {\hat H_n} is close to {H_n} then the bootstrap confidence interval will have coverage close to {1-\alpha}. Formally, one has to show that

\displaystyle  \sup_t |\hat H_n(t) - H_n(t)| \stackrel{P}{\rightarrow} 0

in which case

\displaystyle  \mathbb{P}(\theta\in C_n) \rightarrow 1-\alpha

as {n\rightarrow\infty}.

It is non-trivial to show that {\sup_t |\hat H_n(t) - H_n(t)| \stackrel{P}{\rightarrow} 0} but it has been shown in some generality. See Chapter 23 of van der Vaart (1998) for example.

3. Why Does It Fail?

The bootstrap does not always work. It can fail for a variety of reasons such as when the dimension is high or when {T} is poorly behaved.

An example of a bootstrap failure is in the problem of estimating phylogenetic trees. The problem here is that {T(P)} is an extremely complex object and the regularity conditions needed to make the bootstrap work are unlikely to hold.

In fact, this is a general problem with the bootstrap: it is most useful in complex situations, but these are often the situations where the theory breaks down.

4. What Do We Do?

So what do we do when the bootstrap fails? One answer is: subsampling. This is a variant of the bootstrap that works under much weaker conditions than the bootstrap. Interestingly, the theory behind subsampling is much simpler than the theory behind the bootstrap. The former involves little more than a simple concentration inequality while the latter uses high-powered techniques from empirical process theory.

So what is subsampling?

Stay tuned. I will describe it in my next post. In the meantime, I’d be interested to hear about your experiences with the bootstrap. Also, why do you think the bootstrap is not more popular in machine learning?

References

Efron, Bradley. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 1-26.

Efron, Bradley, and Tibshirani, R. (1994). An Introduction to the Bootstrap. Chapman and Hall.

Holmes, Susan. (2003). Bootstrapping phylogenetic trees: Theory and methods. Statistical Science, 241-255.

van der Vaart, A. (1996). Asymptotic Statistics. Cambridge.

P.S. See also Cosma’s post:
here

Review of “Antifragile” by Nassim Taleb (a.k.a. Doc Savage)

REVIEW OF: “ANTIFRAGILE” by NASSIM TALEB (a.k.a. Doc Savage)

I have not yet finished reading Antifragile by Nassim Taleb. I am at page 214 of this 519 page book. Isn’t it sacrilege to write a review before finishing the entire book? Normally, yes. But you’ll see why this is an exception.

Taleb is well-known for his previous books such as Fooled By Randomness and The Black Swan.

I read Fooled By Randomness a long time ago and, as best as I can recall I liked it. I think the message was: “all traders except for me are idiots.” The message of The Black Swan was that outliers matter. It may sound trite but it is an important point; he is right that many people use models and then forget that they are just approximations. The Black Swan made a lot of people mad. He gave the impression that all statisticians were idiots and didn’t know about outliers and model violations. Nevertheless, I think he did have interesting things to say.

Antifragile continues the Black Swan theme but the arrogant tone has been taken up a notch.

As with Taleb’s other books, there are interesting ideas here. His main point is this: there is no word in the English language to mean the opposite of fragile. You might think that “resilient” or “robust” is the opposite of fragile but that’s not right. A system is fragile if it is sensitive to errors. A system is resilient if it is insensitive to errors. A system is antifragile if it improves with errors.

To understand antifragility, think of things that lead to improvement by trial-and-error. Evolution is an example. Entrepreneurship is another.

Generally, top-down, bureaucratic things tend to be fragile. Bottom-up, decentralized things tend to be anti-fragile. He refers to meddlers who want to impose centralized — and hence fragile — decision making on people as “fragilistas.” I love that word.

I like his ideas about antifragility. I share his dislike for centralized decision-making, bureaucrats, (as well as his dislike of Paul Krugman and Thomas Friedman). So I really wanted to like this book.

The problem is the tone. The somewhat arrogant tone of his previous books has evolved into a kind of belligerent yelling. The “I am smart and everyone else is an idiot” shtick gets tiresome. Having dinner with your know-it-all uncle is tolerable. But spend too much time with him and you’ll go mad.

The book is full bragging; there are continuous references to his amazing wonderful travels, all the cafe’s he has been in around the world, zillions of references to historical and philosophical texts and a steady stream of his likes and dislikes. He particularly dislikes academics, business schools, and especially Harvard. He often talks about the Harvard-Soviet empire. He got an MBA for Wharton where he credits an un-named professor for teaching him about options. But, of course, the professor did not really understand what he was teaching.

We find out that Taleb hates TV, air-conditioning, sissies, and most economists. He has taken up weightlifting and, using a training technique he learned from “Lenny Cake” he can deadlift 350 pounds. He is now so strong that people mistake him for a bodyguard. You can’t make this stuff up.

I think that Taleb is Doc Savage (the Man of Bronze). For those who don’t know, Doc Savage was the hero in a series of books in the 1960’s. Doc Savage was amazing. He was brilliant and strong. He was a scientist, detective, surgeon, inventor, martial arts expert and had a photographic memory. It was hilarious reading those books when I was young because they were so over the top.

Antifragile is “The Black Swan meets Doc Savage.” This is a real shame because, as I said, there are interesting ideas here. But I doubt many serious readers will be able to stomach the whole book. I read some Amazon reviews and I noticed that they were quite bimodal. Many people seem to believe he is the gigantic genius he claims to be and they love the book. More thoughtful readers are put off by the tone and give negative reviews.

If an editor had forced him to tone it down (and make the writing a little more organized) then I think this book would have been good. But he puts editors in the fragilista category. I can imagine the editor trying to give Taleb some suggestions only to be slapped around by author.

Which is why I decided to write a review before finishing the book. You see, only sissy fragalistas finish a book before reviewing it.

Disclosure: I am an Amazon associate. This means that if you click on any links to Amazon in my posts, then I get credit towards a gift certificate.

TO CONDITION, OR NOT TO CONDITION, THAT IS THE QUESTION

TO CONDITION, OR NOT TO CONDITION, THAT IS THE QUESTION

Between the completely conditional world of Bayesian inference and the completely unconditional world of frequentist inference lies the hazy world of conditional inference.

The extremes are easy. In Bayesian-land you condition on all of the data. In Frequentist-land, you condition on nothing. If your feet are firmly planted in either of these idyllic places, read no further! Because, conditional inference is:

The undiscovered Country, from whose bourn
No Traveller returns, Puzzles the will,
And makes us rather bear those ills we have,
Than fly to others that we know not of.

1. The Extremes

As I said above, the extremes are easy. Let’s start with a concrete example. Let {Y_1,\ldots, Y_n} be a sample from {P\in {\cal P}}. Suppose we want to estimate {\theta = T(P)}; for example, {T(P)} could be the mean of {P}.

Bayesian Approach: Put a prior {\pi} on {P}. After observing the data {Y_1,\ldots, Y_n} compute the posterior for {P}. This induces a posterior for {\theta} given {Y_1,\ldots, Y_n}. We can then make statements like

\displaystyle  \pi( \theta\in A|Y_1,\ldots, Y_n) = 1-\alpha.

The statements are conditional on {Y_1,\ldots, Y_n}. There is no question about what to condition on; we condition on all the data.

Frequentist Approach: Construct a set {C_n = C(Y_1,\ldots, Y_n)}. We require that

\displaystyle  \inf_{P\in {\cal P}} P^n \Bigl( T(P)\in C_n \Bigr) \geq 1-\alpha

where {P^n = P\times \cdots \times P} is the distribution corresponding to taking {n} samples from {P}. We the call {C_n} a {1-\alpha} confidence set. No conditioning takes place. (Of course, we might want more than just the guarantee in the above equation, like some sort of optimality; but let’s not worry about that here.)

(I notice that Andrew often says that frequentists “condition on {\theta}”. I think he means, they do calculations for each fixed {P}. At the risk of being pedantic, this is not conditioning. To condition on {P} requires that {P} be a random variable which it is in the Bayesian framework but it is not a random variable in the frequentist framework. But I am probably just nit picking here.)

2. So Why Condition?

Suppose you are taking the frequentist route. Why would you be enticed to condition? Consider the following example from Berger and Wolpert (1988).

I write down a real number {\theta}. I then generate two random variables {Y_1, Y_2} as follows:

\displaystyle  Y_1 = \theta + \epsilon_1,\ \ \ Y_2 = \theta + \epsilon_2

where {\epsilon_1} and {\epsilon_2} and iid and

\displaystyle  P(\epsilon_i = 1) = P(\epsilon_i = -1) = \frac{1}{2}.

Let {P_\theta} denote the distribution of {Y_i}. The set of distributions is {{\cal P} = \{ P_\theta:\ \theta\in\mathbb{R}\}}.

I show Fred the frequentist {Y_1} and {Y_2} and he has to infer {\theta}. Fred comes up with the following confidence set:

\displaystyle  C(Y_1,Y_1) = \begin{cases} \left\{ \frac{Y_1+Y_2}{2} \right\} & \mbox{if}\ Y_1 \neq Y_2\\ \left\{ Y_1-1 \right\} & \mbox{if}\ Y_1 = Y_2. \end{cases}

Now, it is easy to check that, no matter what value {\theta} takes, we have that

\displaystyle  P_\theta\Bigl(\theta\in C(Y_1,Y_2)\Bigr) = \frac{3}{4}\ \ \ \mbox{for every}\ \theta\in \mathbb{R}.

Fred is happy. {C(Y_1,Y_2)} is a 75 percent confidence interval.

To be clear: if I play this game with Fred every day, and I use a different value of {\theta} every day, we will find that Fred traps the true value 75 percent of the time.

Now suppose the data are {(Y_1,Y_2) = (17,19)}. Fred reports that his 75 percent confidence interval is {\{18\}}. Fred is correct that his procedure has 75 percent coverage. But in this case, many people are troubled by reporting that {\{18\}} is a 75 percent confidence interval. Because with these data, we know that {\theta} must be 18. Indeed, if we did a Bayesian analysis with a prior that puts positive density on each {\theta}, he would find that {\pi(\theta=18|Y_1=17,Y_2=19) = 1}.

So, we are 100 percent certain that {\theta = 18} and yet we are reporting {\{18\}} as a 75 percent confidence interval.

There is nothing wrong with the confidence interval. It is a procedure, and the procedure comes with a frequency guarantee: it will trap the truth 75 percent of the time. It does not agree with our degrees of belief but no one said it should.

And yet Fred thinks he can retain his frequentist credentials and still do something which intuitively feels better. This is where conditioning comes in.

Let

\displaystyle  A = \begin{cases} 1 & \mbox{if}\ Y_1 \neq Y_2\\ 0 & \mbox{if}\ Y_1 = Y_2. \end{cases}

The statistic {A} is an ancillary: it has a distribution that does not depend on {\theta}. In particular, {P_\theta(A=1) =P_\theta(A=0) =1/2} for every {\theta}. The idea now is to report confidence, conditional on {A}. Our new procedure is:

If {A=1} report {C=\{ (Y_1 + Y_2)/2\}} with confidence level 1.
If {A=0} report {C=\{ (Y_1-1\}} with confidence level 1/2.

This is indeed a valid conditional confidence interval. Again, imagine we play the game over a long sequence of trials. On the subsequence for which {A=1}, our interval contains the true value 100 percent of the time. On the subsequence for which {A=0}, our interval contains the true value 50 percent of the time.

We still have valid coverage and a more intuitive confidence interval. Our result is identical the Bayesian answer if the Bayesian uses a flat prior. It is nearly equal to the Bayesian answer if the Bayesian uses a proper but very flat prior.

(This is an example where the Bayesian has the upper hand. I’ve had other examples on this blog where the frequentist does better than the Bayesian. To readers who attach themselves to either camp: remember, there is plenty of ammunition in terms of counterexamples on BOTH sides.)

Another famous example is from Cox (1958). Here is a modified version of that example. I flip a coin. If the coin is HEADS I give Fred {Y \sim N(\theta,\sigma_1^2)}. If the coin is TAILS I give Fred {Y \sim N(\theta,\sigma^2)} where {\sigma_1^2 > \sigma_2^2}. What should Fred’s confidence interval for {\theta} be?

We can condition on the coin, and report the usual confidence interval corresponding to the appropriate Normal distribution. But if we look unconditionally, over replications of the whole experiment, and minimize the expected length of the interval, you get an interval that has coverage less than {1-\alpha} for HEADS and greater than {1-\alpha} for TAILS. So optimizing unconditionally pulls us away from what seems to be the correct conditional answer.

3. The Problem With Conditioning

There are lots of simple examples like the ones above where, psychologically, it just feels right to condition on something. But simple intuition is misleading. We would still be using Newtonian physics if we went by our gut feelings.

In complex situations, it is far from obvious if we should condition or what we should condition on. Let me review a simplified version of Larry Brown’s (1990) example that I discussed here. You observe
{(X_1,Y_1), \ldots, (X_n,Y_n)} where

\displaystyle  Y_i = \beta^T X_i + \epsilon_i,

{\epsilon_i \sim N(0,1)}, {n=100} and each {X_i = (X_{i1},\ldots, X_{id})} is a vector of length {d=100,000}. Suppose further that the {d} covariates are independent. We want to estimate {\beta_1}.

The “best” estimator (the maximum likelihood estimator) is obtained by conditioning on all the data. This means we should estimate the vector {\beta} by least squares. But, the least squares estimator is useless when {d> n}.

From the Bayesian point of view we compute we compute the posterior

\displaystyle  \pi\Bigl(\beta_1 \Bigm| (X_1,Y_1),\ldots, (X_n,Y_n)\Bigr)

which, for such a large {d}, will be useless (completely dominated by the prior).

These estimators have terrible behavior compared to the following “anti-conditioning” estimator. Throw away all the covariates except the first one. Now do linear regression using only {Y} and the first covariate. The resulting estimator {\hat\beta_1} is then tightly concentrated around {\beta_1} with high probability. In this example, throwing away data is much better than conditioning on the data. There are some papers on “forgetful Bayesian inference” where one conditions on only part of the data. This is fine but then we are back the the original question: what do we condition on?

There are many other example such as this one.

4. The Answer

It would be nice if there was a clear answer such as “you should always condition” or “you should never condition.” But there isn’t. Do a Google Scholar search on conditional inference and you will find an enormous literature. What started as a simple, compelling idea evolved into a complex research area. Much of these conditional methods are very sophisticated and rely on second order asymptotics. But it is rare to see anyone use conditional inference in complex problems, with the exception of Bayesian inference which some will argue goes for a definite, psychologically satisfying answer at the expense of thinking hard about the properties of the resulting procedures.

Unconditional inference is simple and avoids disasters. The cost is that we can sometimes get psychologically unsatisfying answers. Conditional inference yields more psychologically satisfying answers but can lead to procedures with disastrous behavior.

There is no substitute for thinking. Be skeptical of easy answers.

Thus Conscience does make Cowards of us all,
And thus the Native hue of Resolution
Is sicklied o’er, with the pale cast of Thought,

References

Berger, J.O. and Wolpert, R.L. (1988). The likelihood principle, IMS.

Brown, L. D. (1990). An Ancillarity Paradox Which Appears in Multiple Linear Regression. Ann. Statist. 18, 471-493. link to paper.

Cox, D.R. (1958). Some problems connected with statistical inference. The Annals of Mathematical Statistics, 29, 357-372.