## Monthly Archives: February 2013

### The Other B-Word

Christian has a fun post about the rise of the B-word (Bayesian). “Bayesian ” kills “frequentist.”

Well, how about the other B-word, “Bootstrap.” Look at this Google-trends plot:

The bootstrap demolishes Bayes!

Actually, Christian’s post was tongue-in-cheek. As he points out, “frequentist” is … not a qualification used by frequentists to describe their methods. In other words (!), “frequentist” does not occur very often in frequentist papers.

But all joking aside, that does raise an interesting question. Why do Bayesians put the word “Bayesian” in the title of their papers? For example, you might a see a paper with a title like

“A Bayesian Analysis of Respiratory Diseases in Children”

but you would be unlikely to see a paper with a title like

“A Frequentist Analysis of Respiratory Diseases in Children.”

In fact, I think you are doing a disservice to Bayesian inference if you include “Bayesian” in the title. Allow me to explain.

The great Bayesian statistician Dennis Lindley argued strongly against creating a Bayesian journal. He argued that if Bayesian inference is to be successful and become part of the mainstream of statistics, then it should not be treated as novel. Having a Bayesian journal comes across as defensive. Be bold and publish your papers in our best journals, he argued. In other words, if you really believe in the power of Bayesian statistics, then remove the word Bayesian and just think of it as statistics.

I think the same argument applies to paper titles. If you think Bayesian inference is the right way to analyze respiratory diseases in children, then write a paper entitled:

“A Statistical Analysis of Respiratory Diseases in Children.”

Qualifying the title with the word “Bayesian” suggests that there is something novel or weird about using Bayes. If you believe in Bayes, have the courage to leave it out of the title.

### Rise of the Machines

The Committee of Presidents of Statistical Societies (COPSS) is celebrating its 50th Anniversary. They have decided to to publish a collection and I was honored to be invited to contribute. The theme of the book is Past, Present and Future of Statistical Science.

My paper, entitled Rise of the Machines, can be found here.

To whet your appetite, here is the beginning of the paper.

RISE OF THE MACHINES
Larry Wasserman

On the 50th anniversary of the Committee of Presidents of Statistical Societies I reflect on the rise of the field of Machine Learning and what it means for Statistics. Machine Learning offers a plethora of new research areas, new applications areas and new colleagues to work with. Our students now compete with Machine Learning students for jobs. I am optimistic that visionary Statistics departments will embrace this emerging field; those that ignore or eschew Machine Learning do so at their own risk and may find themselves in the rubble of an outdated, antiquated field.

1. Introduction

Statistics is the science of learning from data. Machine Learning (ML) is the science of learning from data. These fields are identical in intent although they differ in their history, conventions, emphasis and culture.

There is no denying the success and importance of the field of Statistics for science and, more generally, for society. I’m proud to be a part of the field. The focus of this essay is on one challenge (and opportunity) to our field: the rise of Machine Learning.

During my twenty-five year career I have seen Machine Learning evolve from being a collection of rather primitive (yet clever) set of methods to do classification, to a sophisticated science that is rich in theory and applications.

A quick glance at the The Journal of Machine Learning Research (\url{mlr.csail.mit.edu}) and NIPS (\url{books.nips.cc}) reveals papers on a variety of topics that will be familiar to Statisticians such as: conditional likelihood, sequential design, reproducing kernel Hilbert spaces, clustering, bioinformatics, minimax theory, sparse regression, estimating large covariance matrices, model selection, density estimation, graphical models, wavelets, nonparametric regression. These could just as well be papers in our flagship statistics journals.

This sampling of topics should make it clear that researchers in Machine Learning — who were at one time somewhat unaware of mainstream statistical methods and theory — are now not only aware of, but actively engaged in, cutting edge research on these topics.

On the other hand, there are statistical topics that are active areas of research in Machine Learning but are virtually ignored in Statistics. To avoid becoming irrelevant, we Statisticians need to (i) stay current on research areas in ML and (ii) change our outdated model for disseminating knowledge and (iii) revamp our graduate programs.

The rest of the paper can be found here.

### Statistics Declares War on Machine Learning!

STATISTICS DECLARES WAR ON MACHINE LEARNING!

Well I hope the dramatic title caught your attention. Now I can get to the real topic of the post, which is: finite sample bounds versus asymptotic approximations.

In my last post I discussed Normal limiting approximations. One commenter, Csaba Szepesvari, wrote the following interesting comment:

What still surprises me about statistics or the way statisticians do their business is the following: The Berry-Esseen theorem says that a confidence interval chosen based on the CLT is possibly shorter by a good amount of ${c/\sqrt{n}}$. Despite this, statisticians keep telling me that they prefer their “shorter” CLT-based confidence intervals to ones derived by using finite-sample tail inequalities that we, “machine learning people prefer” (lies vs. honesty?). I could never understood the logic behind this reasoning and I am wondering if I am missing something. One possible answer is that the Berry-Esseen result could be oftentimes loose. Indeed, if the summands are normally distributed, the difference will be zero. Thus, an interesting question is the study of the behavior of the worst-case deviation under assumptions (or: for what class of distributions is the Berry-Esseen result tight?). It would be interesting to read about this, or in general, why statisticians prefer to possibly make big mistakes to saying “I don’t know” more often.

I completely agree with Csaba (if I may call you by your first name).

A good example is the binomial. There are simple confidence intervals for the parameter of a binomial that have correct, finite sample coverage. These finite sample confidence sets ${C_n}$ have the property that

$\displaystyle P(\theta\in C_n) \geq 1-\alpha$

for every sample size ${n}$. There is no reason to use the Normal approximation.

And yet …

I want to put on my Statistics hat and remove my Machine Learning hat and defend my fellow statisticians for using Normal approximations.

Open up a random issue of Science or Nature. There are papers on physics, biology, chemistry and so on, filled with results that look like, for example:

${114.21 \pm 3.68}$

These are usually point estimates with standard errors based on asymptotic Normal approximations.

The fact is, science would come to a screeching halt without Normal approximations. The reason is that finite sample intervals are available in only a tiny fraction of statistical problems. I think that Machine Learners (remember, I am wearing my Statistics hat today) have a skewed view of the universe of statistical problems. The certitude of finite sample intervals is reassuring but it rules out the vast majority of statistics as applied to real scientific problems.

Here is a random list, off the top of my head, of problems where all we have are confidence intervals based on asymptotic Normal approximations:

(Most) maximum likelihood estimators, robust estimators, time series, nonlinear regression, generalized linear models, survival analysis (Cox proportional hazards model), partial correlations, longitudinal models, semiparametric inference, confidence bands for function estimators, population genetics, etc.

Biostatistics and cancer research would not exist without confidence intervals based on Normal approximations. In fact, I would guess that every medication or treatment you have ever had, was based on medical research that used these approximations.

When we use approximate confidence intervals, we always proceed knowing that they are approximations. In fact, the entire scientific process is based on numerous, hard to quantify, approximations.

Even finite sample intervals are only approximations. They assume that the data are iid draws from a distribution which is a convenient fiction. It is rarely exactly true.

Let me repeat: I prefer to use finite sample intervals when they are available. And I agree with Csaba that it would be interesting to state more clearly the conditions under which we can tighten the Berry-Esseen bounds. But approximate confidence intervals are a successful, indispensable part of science.

P.S. We’re comparing two frequentist ideas here: finite sample confidence intervals versus asymptotic confidence intervals. Die hard Bayesians will say: Bayes is always a finite sample method. Frequentists will reply: no, you’re just sweeping the complications under the rug. Let’s refrain from the Bayes vs Frequentist arguments for this post. The question being discussed is: given that you are doing frequentist inference, what are your thoughts on finite sample versus asymptotic intervals?

### How Close Is The Normal Distribution?

HOW CLOSE IS THE NORMAL DISTRIBUTION?

One of the first things you learn in probability is that the average ${\overline{X}_n}$ has a distribution that is approximately Normal. More precisely, if ${X_1,\ldots, X_n}$ are iid with mean ${\mu}$ and variance ${\sigma^2}$ then

$\displaystyle Z_n \rightsquigarrow N(0,1)$

where

$\displaystyle Z_n = \frac{\sqrt{n}(\overline{X}_n - \mu)}{\sigma}$

and ${\rightsquigarrow}$ means “convergence in distribution.”

1. How Close?

But how close is the distribution of ${Z_n}$ to the Normal? The usual answer is given by the Berry-Esseen theorem which says that

$\displaystyle \sup_t |P(Z_n \leq t) - \Phi(t)| \leq \frac{0.4784 \,\beta_3}{\sigma^3 \sqrt{n}}$

where ${\Phi}$ is the cdf of a Normal(0,1) and ${\beta_3 = \mathbb{E}(|X_i|^3)}$. This is good news; the Normal approximation is accurate and so, for example, confidence intervals based on the Normal approximation can be expected to be accurate too.

But these days we are often interested in high dimensional problems. In that case, we might be interested, not in one mean, but in many means. Is there still a good guarantee for closeness to the Normal limit?

Consider random vectors ${X_1,\ldots, X_n\in \mathbb{R}^d}$ with mean vector ${\mu}$ and covariance matrix ${\Sigma}$. We’d like to say that ${\mathbb{P}(Z_n \in A)}$ is close to ${\mathbb{P}(Z \in A)}$ where ${Z_n = \Sigma^{-1/2}(\overline{X}_n - \mu)}$ and ${Z\sim N(0,I)}$. We allow the dimension ${d=d_n}$ grow with ${n}$.

One of the best results I know of is due to Bentkus (2003) who proved that

$\displaystyle \sup_{A\in {\cal A}} | \mathbb{P}(Z_n \in A) - \mathbb{P}(Z \in A) | \leq \frac{400\, d^{1/4} \beta}{\sqrt{n}}$

where ${{\cal A}}$ is the class of convex sets and ${\beta = \mathbb{E} ||X||^3}$. We expect that ${\beta = C d^{3/2}}$ so the error is of order ${O(d^{7/4}/\sqrt{n})}$. This means that we must have ${d = o(n^{2/7})}$ to make the error go to 0 as ${n\rightarrow\infty}$.

2. Ramping Up The Dimension

So far we need ${d^{7/2}/n \rightarrow 0}$ to justify the Normal approximation which is a serious restriction. Most of the current results in high dimensional inference, such as the lasso, do not place such as severe restriction on the dimension. Can we do better than this?

Yes. Right now we are witnessing a revolution in Normal approximations thanks to Stein’s method link.
This is a method for bounding the distance from Normal approximations invented by Charles Stein in 1972.

Although the method is 40 years old, there has recently been an explosion of interest in the method. Two excellent references are the book by Chen, Goldstein and Shao (2012) and the review article by Nathan Ross which can be found here.

An example of the power of this method is the very recent paper by Victor Chernozhukov, Denis Chetverikov and Kengo Kato. They showed that, if we restrict ${{\cal A}}$ to rectangles rather than convex sets, then

$\displaystyle \sup_{A\in {\cal A}} | \mathbb{P}(Z_n \in A) - \mathbb{P}(Z \in A) | \rightarrow 0$

as long as ${(\log d)^7/n \rightarrow 0}$. (In fact, they use a lot of tricks besides Stein’s method but Stein’s method plays a key role).

This is an astounding improvement. We only need ${d}$ to be smaller than ${e^{n^{1/7}}}$ instead of ${n^{2/7}}$.

The restriction to rectangles is not so bad; it leads immediately to a confidence rectangle for the mean, for example. The authors show that their results can be used to derive further results for bootstrapping, for high-dimensional regression and for hypothesis testing.

I think we are seeing the beginning of a new wave of results on high dimensional Berry-Esseen theorems. I will do a post in the future on Stein’s method.

References

Bentkus, Vidmantas. (2003). On the dependence of the Berry-Esseen bound on dimension. Journal of Statistical Planning and Inference, 385-402.

Chen, Louis Goldstein, Larry and Shao, Qi-Man. (2010). Normal approximation by Stein’s method. Springer.

Victor Chernozhukov, Denis Chetverikov and Kengo Kato. (2012). Central Limit Theorems and Multiplier Bootstrap when p is much larger than n. http://arxiv.org/abs/1212.6906.

Ross, Nathan. (2011). Fundamentals of Stein’s method. Probability Surveys, 8, 210-293.

Stein, Charles. (1986), Approximate computation of expectations. Lecture Notes-Monograph Series 7.