– by Larry Wasserman

What is the biggest open problem in statistics or machine learning?

This question has come up a number of times in discussions I have had

with various people. (Most recently, it came up when I met with students and post-docs at CSML

where I had a very pleasant recent visit.)

In many fields it’s easy to come up with candidates. Some examples are:

Computer Science: Prove that P is not equal to NP.

Pure Math: Prove the Riemann hypothesis.

Applied Math: Solve the Navier-Stokes existence and smoothness problem.

Physics: Unify general relativity and quantum mechanics.

Or better yet … explain why there is something instead of nothing.

You can probably think of other choices, but my point is that there ARE some obvious choices.

There are plenty of unsolved problems in statistics and machine

learning. Indeed, wikipedia has a page called Unsolved problems in Statistics.

I admire the author of this wikipedia page for even attempting this. It is not his or her fault that the list is pathetic. Some examples

on the list include: the admissibility of the Graybill-Deal estimator (boring), how to detect and correct for systematic errors (important

but vague), the sunrise problem: What is the probability that the sun will rise tomorrow? (you’ve got to be kidding). I’m sure we could

come up with more interesting problems of a more technical nature. But I can’t think of one problem that satisfies the following

criteria:

1. It can be stated succinctly.

2. You can quickly explain it to a non-specialist and they will get the gist

of the problem.

3. It is sexy. In other words, when you explain it to someone they

think it is cool, even if they don’t know what you are talking about.

Having sexy open problems is more than just an amusement; it helps raise

the profile of the field.

Can anyone suggest a good open problem that satisfies the three criteria?

### Like this:

Like Loading...

*Related*

## 14 Comments

How can we know we are saying anything about the world when we know all models are wrong? (Or more concretely, is there a way to construct models of our models, such that we can make statistical statements about how wrong our models can be, under bounded but general modelling failures?)

… or more generally, post-selection inference. Trying to do this as a frequentist makes for very cumbersome interpretations – as in sequential clinical trials; trying to do it as a Bayesian one runs into claims that Bayesians are immune, which (in modern situations featuring stringent selection, e.g. GWAS) is not tenable.

Perhaps this is a silly one, but I’d like to have a general purpose algorithm for translating between models stated using regularization and models stated using priors. This is easy for models like ridge regression, but it doesn’t seem as easy for models using, for example, CRP or IBP priors.

I’m also interested in this. A reasonable approach is to (1) study how variational inference works with these priors, (2) to see the variational inference updates as some sort of gradient descent (3) and then use this to infer what kind of function these gradients would minimize. For instance, it’s not really hard to see IBP’s as some sort of ridge regression + logistic regression if you actually look at the variational inference updates.

We often get useful answers from models that make assumptions which we know to be wrong (e.g., linearity, independence, Gaussian-distributed). Linear regression, PCA, and Kalman filtering are some well-known examples. These models are useful in the sense that we can often use them to make accurate predictions. Conversely, there are many models that make more reasonable assumptions but often give the wrong answer. Is there a theory that tells us when to make wrong assumptions and when to make “less wrong” assumptions? A hierarchical Bayesian model won’t help here, since it is making its own assumptions.

Sam,

We actually discuss this in Section 9.3 of Bayesian Data Analysis. We offer no general results, but, in the example we consider, we are able to pin down the effectiveness of the crude-and-wrong-but-it-works method to a particular unstated aspect of the problem (in this case, the wrong-but-effective method was a normal distribution, and the key unstated aspect of the problem was that the space was bounded so there was no need to worry about extreme tail events), and we were able to rule out the reasonable-and-more-general-but-gives-wrong-answer method using posterior predictive checks.

So here’s the short answer:

1. If a simple and wrong method works, it can be helpful to try and understand what aspects of the problem make it work. Don’t be so quick to assume that just because a method is quick and dirty, that it will work.

2. Check fit of your models. Some of the worst things won’t happen if the problems are caught in the model-checking stage.

I really like this post. Though I never have consciously thought about it, this has been there somewhere at the back of my mind.

Michael Jordan (the Statistician one) proposed a list of open problems in Bayesian statistics in the 2011 ISBA Bulletin. This is the link

http://www4.stat.ncsu.edu/~reich/st740/Bayesian_open_problems.pdf

…ok, none of the following can really be labeled as the “biggest unsolved stat problem” of the century…lets just say they bug me a bit…

1. For every single stat problem that comes to my mind, there’re plenty of (adaptive-) minimax-rate-optimal procedures…theoretically indistinguishable but quite different in practice. Can we do any better than this? On a related note: in this kind of analysis we usually need to bound some free/smoothing parameter to get an estimator with the correct rate. Have you ever tried to use this “optimal” values in practice??? They (always?) suck! To me, this just means there’s a nonignorable gap between current theory and practice…maybe I’m missing something…

2. **Practically** useful nonparametric **credible** sets with guaranteed frequentist coverage…where are you??? In a sense, this is my very-personal-statistical-incarnation of the Higgs boson… =)

3. …is there life beyond James-Stein? I mean…to some extend (and of course IMHO), the main (only?) paradigm at work in today theory and practice can be summarized in one word: shrink…in a way (priors) or another (penalties/”regularizers/etc”). Anything else at the horizon?

…I think I had something else in mind but…well, way better if I stop here…

Ciao Larry

–P

If you are looking for theoretical previously unsolved problems, then I have no idea. But I think the most interesting problem in science is “the file drawer problem” and the related “multiple comparison problem” and their relation to public policy.

1) Succinct Statement. There is little to no public information on failed experiments, because they are rarely submitted to be published. Much time is wasted trying to replicate the unreplicateable because we have no idea how many attempts were made before the scientist who rejected a Null finally got a result she/he wanted. There should be rules on this..

2) Explanation to non specialist. Imagine you have coin and you are trying to determine if it is a fair coin (50-50 heads or tails) or weighted toward heads. A scientist says he will test the coin for you. An hour later he comes back to you and says he flipped the coin 10 times and got 9 heads. He tells you the odds of that happening by chance is 1% if it were a fair coin, which is true. Therefore it is a weighted coin he tells you.

What he did not tell you is that he tried the experiment 100 times and only one was as high as 9 heads. In reality, the odds were not 1% but as high as 99% one of those experiments would get 9 heads if it were a fair coin.Therefore, the coin was not really weighted toward heads.

While this is an exaggerated example, many scientific experiments are the equivalent of this example, either by ignorance or design. We are often not told how many times different or multiple tests were conducted before a given experiment was successful.

3) Why is this interesting? Many models which portray themselves as science are really the equivalent of this coin experiment. They include almost all fields of study in the social and other sciences, including macroeconmics, politics, climate science, nutrition, and pharmaceuticals. These bad studies are ecouraged because grants are rewarded for finding discoveries, not rejecting them. This leads to a disproportional share of grants being driven by those who incentivised to be biased. My rule of thumb on surprising results is they are wrong due to this problem.

What a fun question!

For machine learning here’s an easy one: how can we create a machine that learns like a human?

For statistics/physics: which processes in this universe are purely deterministic, and which are fundamentally unknowable?

Let me suggest a few more characteristics that are generally common to the biggest open problems of those other fields. The problems are so difficult that very few researchers actually work on solving them. The problems themselves are rather well understood, and in some cases we even have a good understanding of the nature of the gap between our current knowledge and the solution to the problem (yet we don’t know how to bridge the gap). The majority of research in the field can carry on without really worrying about these open problems (e.g. explaining why there is something instead of nothing will not change the fact that there is something; and computer algorithms today are designed for NP-complete problems without too much concern about the possibility that such problems are actually in P).

In this spirit, my nomination for the biggest open problem in statistics is causal inference.

And for machine learning my nomination is synthesis of learning (using synthesis in the Bloom’s taxonomy sense that an educator would use about human learning). This problem has been explored in a limited form as lifelong learning or even more limited as transfer learning. One big problem with current machine learning is that learning occurs in isolated domains/contexts where given enough data, learning saturates. A new domain or context can be introduced and learning starts over (at best with a small headstart if the new problem has at least the same type of input features as the previously learned problem).

On the computational learning theory side, PAC learning DNFs in polynomial time has been open since Valiant’s original paper “A Theory of the Learnable”, and has motivated a lot of work in the field. I think this should satisfy your criteria, as long as one can succinctly justify the importance of the concept class of DNFs (Valiant does a pretty good job himself, saying: “Such expressions appear particularly easy for humans to comprehend. Hence we expect that any practical learning system would have to allow for them.”)

if you randomly choose an option to this question wht is chance tht it will be right?? options are a.25% b.50% c.75% d.25%