## Statistics Declares War on Machine Learning!

STATISTICS DECLARES WAR ON MACHINE LEARNING!

Well I hope the dramatic title caught your attention. Now I can get to the real topic of the post, which is: finite sample bounds versus asymptotic approximations.

In my last post I discussed Normal limiting approximations. One commenter, Csaba Szepesvari, wrote the following interesting comment:

What still surprises me about statistics or the way statisticians do their business is the following: The Berry-Esseen theorem says that a confidence interval chosen based on the CLT is possibly shorter by a good amount of ${c/\sqrt{n}}$. Despite this, statisticians keep telling me that they prefer their “shorter” CLT-based confidence intervals to ones derived by using finite-sample tail inequalities that we, “machine learning people prefer” (lies vs. honesty?). I could never understood the logic behind this reasoning and I am wondering if I am missing something. One possible answer is that the Berry-Esseen result could be oftentimes loose. Indeed, if the summands are normally distributed, the difference will be zero. Thus, an interesting question is the study of the behavior of the worst-case deviation under assumptions (or: for what class of distributions is the Berry-Esseen result tight?). It would be interesting to read about this, or in general, why statisticians prefer to possibly make big mistakes to saying “I don’t know” more often.

I completely agree with Csaba (if I may call you by your first name).

A good example is the binomial. There are simple confidence intervals for the parameter of a binomial that have correct, finite sample coverage. These finite sample confidence sets ${C_n}$ have the property that

$\displaystyle P(\theta\in C_n) \geq 1-\alpha$

for every sample size ${n}$. There is no reason to use the Normal approximation.

And yet …

I want to put on my Statistics hat and remove my Machine Learning hat and defend my fellow statisticians for using Normal approximations.

Open up a random issue of Science or Nature. There are papers on physics, biology, chemistry and so on, filled with results that look like, for example:

${114.21 \pm 3.68}$

These are usually point estimates with standard errors based on asymptotic Normal approximations.

The fact is, science would come to a screeching halt without Normal approximations. The reason is that finite sample intervals are available in only a tiny fraction of statistical problems. I think that Machine Learners (remember, I am wearing my Statistics hat today) have a skewed view of the universe of statistical problems. The certitude of finite sample intervals is reassuring but it rules out the vast majority of statistics as applied to real scientific problems.

Here is a random list, off the top of my head, of problems where all we have are confidence intervals based on asymptotic Normal approximations:

(Most) maximum likelihood estimators, robust estimators, time series, nonlinear regression, generalized linear models, survival analysis (Cox proportional hazards model), partial correlations, longitudinal models, semiparametric inference, confidence bands for function estimators, population genetics, etc.

Biostatistics and cancer research would not exist without confidence intervals based on Normal approximations. In fact, I would guess that every medication or treatment you have ever had, was based on medical research that used these approximations.

When we use approximate confidence intervals, we always proceed knowing that they are approximations. In fact, the entire scientific process is based on numerous, hard to quantify, approximations.

Even finite sample intervals are only approximations. They assume that the data are iid draws from a distribution which is a convenient fiction. It is rarely exactly true.

Let me repeat: I prefer to use finite sample intervals when they are available. And I agree with Csaba that it would be interesting to state more clearly the conditions under which we can tighten the Berry-Esseen bounds. But approximate confidence intervals are a successful, indispensable part of science.

P.S. We’re comparing two frequentist ideas here: finite sample confidence intervals versus asymptotic confidence intervals. Die hard Bayesians will say: Bayes is always a finite sample method. Frequentists will reply: no, you’re just sweeping the complications under the rug. Let’s refrain from the Bayes vs Frequentist arguments for this post. The question being discussed is: given that you are doing frequentist inference, what are your thoughts on finite sample versus asymptotic intervals?

1. Posted February 9, 2013 at 11:11 am | Permalink

Good reality check there. By the way, typo in there: “..treatment you have every had..” (every => ever)

• Posted February 9, 2013 at 11:30 am | Permalink

thanks

2. Posted February 9, 2013 at 12:12 pm | Permalink

Hah! No frequentist versus Bayesian arguing? You are wearing your frequentist hat? I don’t believe you. Twenty times bitten, now shy. The statistics blogosphere has been corrupted by those ideologically fickle, relativist econo-bloggers. Their contentious debate has sullied English language statistical blogs. Thank G_d for German statisticians! THEY know the importance and continuing relevancy of frequentist methods for the physical sciences, manufacturing, reliability analyses and so forth. THEY are respectful, formal and polite.

English-language statistical blogs have become obsessed with social science applications of statistical methods in a futile attempt to forecast human behavior and complex systems that control theory was never intended for. When human behavior and human behavior-driven systems such as the global economy or U.S. elections have illogical outcomes, everyone blames statisticians, especially frequentists. We “lack the honesty of Bayesians who acknowledge their priors”. As if to say that frequentists are morally bankrupt and deceptive. We are NOT!

Sorry. Professor Andrew Gelman was my hero. And Professor Cosma Shalizi. How does that nice Professor Deborah Mayo remain so tolerant and patient? I trust her.

3. Posted February 9, 2013 at 12:27 pm | Permalink

So for an approximate confidence interval, is there an easy way to communicate what it means? It doesn’t mean “probability 0.05 that this interval would fail to cover the true parameter”. I guess what we’d like it to mean is “well, it’s pretty close to probability 0.05 that it fails to cover.” How often is this statement really true in the uses mentioned in the post? And when it is false, is the problem the normal approximation, or some other cause? (My guess would be that a larger culprit is model mismatch: it’s a somewhat meaningless statement that a regression coefficient falls in some interval if the data don’t follow the claimed model anyway.)

• Posted February 9, 2013 at 1:44 pm | Permalink

Well, as I say, science of full of approximations.
We just live with it.
You are pretty sure that the Gates building won’t collapse
when you go to work. How sure? Hard to put an exact number it.
But you go to work anyway.

• Posted February 9, 2013 at 5:20 pm | Permalink

I agree with the need to make approximations; I’m just asking how one can tell when it’s relatively safe vs. dangerous to make them. The chief thing that convinces me that GHC will likely stand is that modern buildings don’t fall down all that often, and I agree that I don’t have to put an exact number on it as long as I’m convinced it’s very small. (Perhaps my subconscious is using the Hoeffding lemma?) Our record with confidence intervals is pretty bad by comparison — if buildings fell down as often as the historical amount by which confidence intervals have exceeded their nominal failure probabilities, going to work would be a much riskier adventure. So, it makes sense to ask what are the causes of the approximation failures. (My guess above is that CLT is a relatively small culprit.)

• Posted February 9, 2013 at 5:34 pm | Permalink

I agree, the CLT is probably not the culprit.
The biggest culprit I think, are unknown biases.

Posted February 9, 2013 at 2:11 pm | Permalink

Perhaps an even stronger argument for CLT-based confidence intervals is that they are simple to understand (at a superficial level) and easy to compute. You can teach them to a wide range of researchers, who will make not-outrageously-bad conclusions based on them.

• Posted February 16, 2013 at 10:41 pm | Permalink

Yes, perhaps this is the reason.

5. Ken
Posted February 9, 2013 at 5:51 pm | Permalink

Anyone who has seen the Terminator movies will know that this will end badly. Seriously, profile likelihood works very well for many problems and is not hard to calculate. I’m surprised that the machine learning people are that worried, they usually have large n, and as mentioned violations of assumptions are much more important.

• Posted February 16, 2013 at 10:44 pm | Permalink

Ken,
See my comments below: What you may think is large, may not be so large. But I am not sure that machine learning people are indeed that worried. It just seems a cultural difference. Oh, and the bounds derived by machine learning people have often a different purpose: To characterize minimax rates for some class of problem or some procedure.

6. Posted February 9, 2013 at 8:11 pm | Permalink

Larry, thanks for another excellent post. Will you follow up with one on asymptotic refinement and the bootstrap?

• Posted February 9, 2013 at 8:12 pm | Permalink

I’d add that to the list!

• Anon
Posted February 12, 2013 at 1:13 pm | Permalink

Yes, I’ve wondered if we can provide similar bounds for the approximations made by bootstrap-based confidence intervals?

7. Christian Hennig
Posted February 11, 2013 at 8:16 am | Permalink

I may have missed something but does the machine learning community indeed have proper finite-n theory for all kinds of stuff? OK, there are some general inequalities that give you very very large intervals for small to moderate n that are rather useless for applications, and , OK, with a few million of observations these may be just fine because precision *given the model* is not the problem anymore anyway, neither for the statistician…

• Posted February 16, 2013 at 10:14 pm | Permalink

If the goal is to have small intervals, can’t I just divide my intervals by f(n) (n is the sample size), where f(n) is huge for n small and f(n) goes to one as n goes to infinity? I will have very small intervals, which are asymptotically correct..

8. CJ
Posted February 13, 2013 at 2:46 pm | Permalink

For those not in the know: Any chance someone could provide a reference/example about finite-sample confidence intervals? Being someone with a more pure stats background I’ve only seen the CLT derived asymptotic normal confidence intervals.

• Posted February 16, 2013 at 10:50 pm | Permalink
9. Posted February 16, 2013 at 10:40 pm | Permalink

Sure you can call me by my first name especially now that I have started a war with your help LOL.
Concerning that every method uses assumptions: True. However, fewer assumptions makes the conclusions more robust. Robustness may be hard to quantify though.
There was another comment on that in machine learning there is plenty of data, so the asymptotics should really “kick in”. This is true sometimes, but not always. And the amount of data is not an absolute quantity: garbage in, garbage out and also nonparametric rates with minimal assumptions in high dimensions predict that the asymptotics will never happen. Or, the distribution that you start with could be really skewed. An illustration of how badly the CLT can fail for sample sizes that some might think are “large”. This is an example I have learned from my wife when she was teaching risk theory (in a statistics department!): Imagine that n=1000 young fellows take out a life insurance policy for a period of one year. The probability of dying within the said year is p=0.001. The payment for every death is one dollar. What is the probability that the total payment is at least four? The exact number is something like 0.02, while the CLT approximation is 0.006 (I am looking this up in a book). Pretty big difference — no insurance company would want to go with a CLT approximation in this case. Of course, “everyone knows” that this is the case of small p in a binomial(n,p) distribution and so one would be better off using a Poisson approximation (indeed, the Poisson approximation will be pretty good). Anyhow, my point is that less assumptions is better than more assumptions and if someone uses some approximation, they better think of the error introduced by the approximation error. The error could be large.

• anon
Posted February 17, 2013 at 2:59 pm | Permalink

but why would one use an approximation, when it is known how to calculate exactly ?

• Posted February 17, 2013 at 3:27 pm | Permalink

because often it isn’t known how to calculate it exactly

10. Posted February 18, 2013 at 8:23 pm | Permalink

Reblogged this on Stats in the Wild.

11. Posted February 18, 2013 at 10:52 pm | Permalink

Reblogged this on lava kafle kathmandu nepal.

1. […] 著名的标题党Larry Wasserman（卡耐基梅隆统计学和机器学习教授）发表了一篇日志名为“统计学向机器学习宣战”。其实也就是解释“为毛这些该死的统计学家总是用正态近似去求区间估计”。 […]

2. By (Bi-)weekly links for February 18 « God plays dice on February 18, 2013 at 8:01 pm

[…] Larry Wasserman: statistics declares war on machine learning. […]

3. […] 著名的标题党Larry Wasserman（卡耐基梅隆统计学和机器学习教授）发表了一篇日志名为“统计学向机器学习宣战”。其实也就是解释“为毛这些该死的统计学家总是用正态近似去求区间估计”。 […]

4. By Somewhere else, part 38 | Freakonometrics on March 15, 2013 at 10:07 pm

[…] Declares War on Machine Learning!” https://normaldeviate.wordpress.com/ … by […]

5. By Lies, damn lies | bluedeckshoe.com on March 21, 2013 at 4:25 pm

[…] Statistics Declares War on Machine Learning! (normaldeviate.wordpress.com) […]