Monthly Archives: June 2012

Self-Repairing Bayesian Inference

Peter Grunwald gave a talk in the statistics department on Monday. Peter does very interesting work and the material he spoke about is no exception. Here are my recollections from the talk.

The summary is this: Peter and John Langford have a very cool example of Bayesian inconsistency, much different than the usual examples of inconsistency. In the talk, Peter explained the inconsistency and then he talked about a way to fix the inconsistency.

All previous examples of inconsistency in Bayesian inference that I know of have two things in common: the parameter space is complicated and the prior does not put enough mass around the true distribution. The Grunwald-Langford example is much different.

Let {{\cal P}} be a countable parameter space. We start with the very realistic assumption that the model {{\cal P}} is wrong. That is, the true distribution {P_*} is not in {{\cal P}}. It is generally believed in this case that the posterior concentrates near {Q_*}, the distribution in {{\cal P}} closest (in Kullback-Leibler distance) to {P_*}. In Peter and John’s example, the posterior edoes not concentrate around {Q_*}. What is surprising, is that this inconsistency holds, even though the space is countable and even though the prior puts positive mass on {Q_*}. If this doesn’t surprise you, it should.

On the other hand, there are papers like Kleijn and van der Vaart (The Annals of Statistics, 2006, pages 837–877) that show that the posterior does indeed concentrate around {Q_*}. So what is going on?

The key is that in the Grunwald-Langford example, the space {{\cal P}} is not convex. (More precisely, the projection of {P_*} onto {{\cal P}} does not equal the projection of {P_*} onto the convex hull of {{\cal P}}.)

You can fix the problem by replacing {{\cal P}} with its convex hull. But this is not a good fix. To see why, suppose that each {p\in {\cal P}} corresponds to some classifier {h} in some set {{\cal H}}. If {p_1,p_2\in {\cal P}} correspond to two different classifiers {h_1, h_2}, the mixture {(p_1 + p_2)/2} might not correspond to any classifier in {{\cal H}}. Forming mixtures might take you out of the class you are interested in.

Instead, Peter has a better fix. Instead of using the posterior distribution, use the generalized posterior {g(\theta)\propto \pi(\theta) L(\theta)^\eta} where {\pi} is the prior, {L} is the likelihood and {0\leq \eta \leq 1} is a constant that can change with sample size {n}. It turns out that there is a constant {\eta_{\rm crit}} such that the generalized posterior is consistent as long as {\eta < \eta_{\sf crit}}.

The problem is that we don’t know {\eta_{\sf crit}}. Now comes a key observation. Suppose you wanted to predict a new observation. In Bayesian inference, one usually uses the predictive distribution {p(y|{\rm data}) = \int p(y|\theta) p(\theta|{\rm data})d\theta} which is a mixture with weights based on the posterior. Consider instead predicting using {p(y|\theta)} where {\theta} is randomly chosen from the posterior. Peter calls these “mixture prediction” and “randomized prediction.” If the posterior is concentrated, then these two types of prediction are similar. He uses the difference between the mixture prediction and the randomized prediction to estimate {\eta}. (More precisely, he builds a procedure that mimics a generalized posterior based on a good {\eta}.)

The result is a “fixed-up” Bayesian procedure that automatically repairs itself to avoid inconsistency. This is quite remarkable.

Unfortunately, Peter has not finished the paper yet so we will have to wait a while for all the details. (Peter says he might start is own blog so perhaps we’ll be able to read about the paper on his blog.)

Postscript: Congratulations to Cosma for winning second place in the best science blog contest.

— Larry Wasserman

Ockham’s Razor

From Friday to Sunday I attended a Philosophy conference on the
Foundations for Ockham’s Razor
. Fellow bloggers Deborah Mayo (a.k.a. the
frequentist in exile) and Cosma Shalizi The conference is organized by Kevin Kelly.

Here is a conference photo (due to B. Kliban):

Cosma and Deborah are giving blow by blow details on their blogs. I’ll just make a few general comments.

The idea of the workshop is to bring together philosophers, statisticians, computer scientists, mathematicians etc to talk about
one of the “principles” that comes up in statistics (and more generally in science), namely, simplicity. In particular, we were
there to discuss the principle called Ockham’s razor.

There was plenty of agreement (at least at this workshop) that nature
can be very complex and yet it is still useful (in many cases) to bias
statistical procedures towards simpler models. Making this precise
usually involves invoking measures of complexity such as: VC
dimension, covering numbers, Rademacher complexity, effective degrees
of freedom and so on. And there are many methods for choosing the
degree of complexity, including: AIC, BIC, cross-validation,
structural risk minimization, Bayesian methods, etc. Nevertheless,
providing a philosophical foundation for all this can be challenging.

Here is a brief summary of the talks. (The full lectures will be
posted online.)

The conference began with a talk by Vladimir
Vapnik.
It is hard to overstate Professor Vapnik’s influence in
machine learning and statistics. His fundamental work on uniform laws
of large numbers, as well as support vector machines and
kernelization, is legendary. He talked about an extension of support
vector machines but there were also hints of eastern mysticism and poetry. Next
was Vladimir Cherkassky. (This was the “Russians named Vladimir”
session.) He also talked about VC-based ideas.

Peter Grunwald was next. Since Peter is a statistician (as well as a computer
scientist) I found his talk easier to understand. He talked about
probability-free inference (i.e. uniform over all data sequences).
These are procedures that work well if your hypothesized model (or
prior) is right. But they still have guarantees if your model is
wrong. In fact, they satisfy guarantees even if the data are not
generated by any stochastic mechanism. (I’ll blog about this type of
inference in detail at some point.)

My talk was on examples where standard tuning parameter methods
like cross-validation fail, because they choose overly complex models.
My claim was that, in many cases, we sill don’t have provably good methods
of choosing models with the right degree of complexity.
(Density estimation when the support of the distributions is
concentrated near a low dimensional manifold is an example.)

Elliot Sober talked about notions of parsimony in constructing
phylogenetic trees. There seems to be some disagreement about whether
simplicity of these trees shold be defined in terms of likelihood or
in terms of other measures (like how many muttaions must occur to get
a particular tree).

Unfortunately I missed the talks Saturday morning but I caught the
tail end of Kevin Kelly’s talk. Kevin has spent years developing a
methematical, probability-free account, of the properties of methods
that select “good” models. His work is quite difficult; there are
lots of mathematical details. I think he has some novel ideas but
they are in a language that is hard for statisticians to understand.
I urged Kevin to write a paper for a statistics journal expaining his
theory to statisticians.

Next was Cosma on complex stochastic processes; see his papers for the details.

Hannes Leeb talked about the lack of uniform consistency in model
selection procedures. (The reason for this behavior is familiar if
you remember the Hodges estimator; basically, if a parameter
is of order O(1/\sqrt{n}) from 0, then it is statistically
difficult to distinguish from 0. This screws up uniform consistency.)

Malcolm Forster gave a historical perspective, discussing Ptolemy, Copernicus, Kepler and Tycho Brahe.

Digression: Tycho Brahe is one of my favorite historical characters.
Kevin tells me that the name is pronounced “Tooko Bra.”
Other sources claim it is “Teeko Bra.”
As you may recall, Tycho had a silver nose, having lost his real nose in a duel.
He died from a burst bladder during a drinking binge. He owned a pet
elk (or moose?) which died from drinking too much beer. For more on
this colorful character see here.

The last talk was Deborah Mayo, the frequentist in exile, on “Error
Correction, Severity, and Truth: What’s Simplicity Got to Do With
it?” Deborah argued that scientific inference is about exploring and
probing various hypotheses and cannot be reduced to something as
simple as “choose the simplest model consistent with the data.”
To undertsand her viewpoint, check out her blog or her book.

The conference ended with a round-table discussion. Professor Vapnik
started with long comment about the need to focus on quanities that
can be estimated well (i.e. with uniform probability bounds). He
suggested that statisticians were led astray by Fisher and maximum
likelihood because many things that we are trying to esimate in
statistics are, in a sense, ill-posed. At least, I think that was his
message. I am sympathetic to this viewpoint but felt it was a bit
over-stated. Several of us responded and tried to defend traditional
statistical theory. But is was all very cordial and it was great fun
getting to have a panel discussion with Vladmir Vapik.

We were all supposed to answer the question
“what is Ockham’s razor.”
To me it is just the vague (but useful)
notion that we should not, generally speaking,
fit models that are too complicated.
In other words, don’t overfit.
I don’t think there is anything controversial about that.
But if you try to view it as a more formal
principle, then there are bound to be
differences of opinion.

How would you define Ockham’s razor?

—Larry Wasserman

The Biggest Unsolved Problem

– by Larry Wasserman

What is the biggest open problem in statistics or machine learning?

This question has come up a number of times in discussions I have had
with various people. (Most recently, it came up when I met with students and post-docs at CSML
where I had a very pleasant recent visit.)

In many fields it’s easy to come up with candidates. Some examples are:

Computer Science: Prove that P is not equal to NP.

Pure Math: Prove the Riemann hypothesis.

Applied Math: Solve the Navier-Stokes existence and smoothness problem.

Physics: Unify general relativity and quantum mechanics.
Or better yet … explain why there is something instead of nothing.

You can probably think of other choices, but my point is that there ARE some obvious choices.

There are plenty of unsolved problems in statistics and machine
learning. Indeed, wikipedia has a page called Unsolved problems in Statistics.

I admire the author of this wikipedia page for even attempting this. It is not his or her fault that the list is pathetic. Some examples
on the list include: the admissibility of the Graybill-Deal estimator (boring), how to detect and correct for systematic errors (important
but vague), the sunrise problem: What is the probability that the sun will rise tomorrow? (you’ve got to be kidding). I’m sure we could
come up with more interesting problems of a more technical nature. But I can’t think of one problem that satisfies the following
criteria:

1. It can be stated succinctly.

2. You can quickly explain it to a non-specialist and they will get the gist
of the problem.

3. It is sexy. In other words, when you explain it to someone they
think it is cool, even if they don’t know what you are talking about.

Having sexy open problems is more than just an amusement; it helps raise
the profile of the field.

Can anyone suggest a good open problem that satisfies the three criteria?

George Casella

Today I learned that my dear friend George Casella has passed away.

George is well known for many things: his influential textbooks, his
research on decision theory, his research on MCMC, his good
statistical sense, his service work, his crystal clear lectures, … I
could go on and on. But there was something else about George that
made him very special. It’s very hard to put into words but, put
simply, he was probably the most positive, optimistic person I have
ever known.

I first met George about 24 years. He was visiting Purdue at the time
and I went there to give a talk. About five minutes after I met him, I
felt like I had known him for years. Shortly after that, he invited me
to a workshop he had organized at Cornell. To this day, those of us
who attended the workshop still refer to it as “Camp Casella.” We
did work. We gave talks. We listened to talks. We drank. We ate. We
had fun and we learned a lot. I remember one morning he knocked on my
door very early in the morning. I was surely hung over. But George
insisted we go for a five mile run. It was a grueling run involving
some of Ithaca’s steep hills. I was destroyed. George barely seemed
winded.

I saw George last Fall when I visited the University of Florida. He had
just finished a round of chemotherapy. But that did not stop coming
in to work. And he was, as always, brimming with enthusiasm. When he
told me the many things he was doing (including supervising and large
number of students) I felt like a loafer. In January, he visited us in
Pittsburgh (with his wonderful daughter Sarah) and we had a great
time. Talking about statistics while eating, drinking wine, and
drinking scotch. It doesn’t get better than that.

I have known George since 1988. It is hard to imagine the field of
statistics without him. We will miss you my friend.

My deepest sympathies to Anne, Sarah and Benjamin.

(Note: more about George’s inestimable influence at Christian’s blog. )

Causation

—Larry Wasserman

In this post, I will discuss something elementary yet important. Causation versus Association.

Although it is well-worn territory, the topic of causation still causes enormous confusion. The media confuse correlation and causation all the time. In fact, it is common to see a reporter discuss a study, warn the listener that the result is only an association and has not been proved to be causal, and then go on to discuss the finding as if it is causal. It usually goes something like this:

“A study reports that those who sleep less are more likely to have health problems.”

So far so good.

“Researchers emphasize that they have not established a causal connection”

Even better, no claim of causation. But then:

“So make sure you get sufficient sleep to avoid these nasty health problems.”

Ouch, there it is. The leap from association to causation.

What’s worse is that even people trained to know better, namely statisticians and ML people, make the same mistake all the time. They will teach the difference between the two in class and then, a minute after leaving class, fall back into the same fog of confusion as the hypothetical reporter above. (I am guilty of this too.) This just shows how hard-wired our brains are for making the causal leap.

There are (at least) two formal ways to discuss causation rigorously: one is based on counterfactuals and the other is based on casual directed acyclic graphs (DAG’s). They are essentially equivalent. Some things are more easily discussed in one langauge than the other. I will use the language of DAG’s here.

Consider a putative cause {X} and a response {Y}. Let {U} represent all variables that could affect {X} or {Y}. To be concrete, let’s say {X} is stress and {Y} is getting a cold. The variables {U} is a very high-dimensional vector including, genetic variables, environmental variables, etc. The elements of {U} are called confounding variables.

The causal DAG looks like this:

Suppose we only observe {X} and {Y} on a large number of people. {U} is unobserved.

The DAG has several implications. First, the distribution {p(u,x,y)} factors as {p(u)p(x|u)p(y|x,u)}. Well, that’s a pretty vacuous statement but let’s keep going. The association between {Y} and {X} is described by the conditional distribution {p(y|x)}. This distribution can be consistently estimated from the observed data. No doubt we will see an association between {Y} and {X} (that is, {p(y|x)} will indeed be a function of {x}). As usual,

\displaystyle p(y|x) = \frac{p(y,x)}{p(x)}.

The causal distribution is the distribution we get — not by conditioning — but by intervening and changing the graph. Specifically, we break the arrow into {X} and we fix {X} at a value {x}. The new graph is

The joint distribution for this graph is

\displaystyle p(u)\delta_X(x) p(y|x,u)

where {p(x|u)} has been replaced with a point mass distribution at {X=x}. The causal distribution is the marginal distribution of {Y} in the new graph, which is,

\displaystyle \int p(y|x,u) p(u) du .

Using the language of Spirtes, Glymour and Scheines and Pearl, we can summarize this as:

\displaystyle p(y|{\rm Observe}\, X=x) = \frac{p(y,x)}{p(x)}

while

\displaystyle p(y|{\sf Set}\,X=x) = \int p(y|x,u) p(u) du .

We immediately deduce the following:

1. They are different. This is just the formalization of the fact that causation is not association.

2. If there is no arrow from {X} to {Y} in the original graph we will find that {p(y|{\rm Observe}\, X=x)} depends on {x} but {p(y|{\rm Set}\, X=x)} does not depend on {x}. This is the common case where there is no causal relationship between {Y} and {X} yet we see a predictive relationship between {X} and {Y}. This is grist for many bogus stories in CNN and the NY Times.

3. The causal distribution {p(y|{\rm Set}\, X=x)} is not estimable. It depends on {p(y|x,u)} and {p(u)} but {U} is not observed.

4. The reason why epidemiologists collect data on lots of other variables is that they are trying to measure {U} or at least, measure some elements of {U}. Then they can estimate

\displaystyle p(y|{\sf Set}\,X=x) = \int p(y|x,u) p(u) du .

This is called, adjusting for confounders. Of course, they will never measure all of {U} which is why observational studies, though useful, must be taken with a grain of salt.

5. In a randomized study, where we assign the value of {X} to subjects randomly, we break the arrow from {U} to {X}. In this case, it is easy to check that

\displaystyle p(y|{\sf Set}\,X=x) = p(y|x).

In other words, we force association to equal causation. That’s why we spend millions of dollars doing randomized studies and it’s why randomized studies are the gold standard.

This raises an interesting question: have there been randomized studies to followup results from observaional studies? In fact, there have. In a recent article, (Young and Karr, 2011) found 12 such randomized studies following up on 52 positive claims from observational studies. They then asked, of 52 claims, how many were verified by the randomized studies? The answer is depressing: zero.

That’s right, 0 out of 52 effects turned out to be real.

We should be careful not to over-generalize here because these studies are certainly not a representative sample of all studies. Still, 0/52 is sobering.

In a future post, I will discuss Simpsons paradox. Here is a preview. Suppose {U} is observed. If there is an arrow from {U} to {X} then {p(y|{\sf Set}\,X=x) \neq p(y|x)} but when there is no arrow from {U} to {X} then {p(y|{\sf Set}\,X=x) = p(y|x)}. Nothing complicated. But when we change the math into words, people— including most statisticians and computer scientists— get very confused. I’ll save the details for a future post.

Freedman’s Neglected Theorem

—Larry Wasserman

In this post I want to review an interesting result by David Freedman (Annals of Mathematical Statistics, Volume 36, Number 2 (1965), 454-456) available at projecteuclid.org.

The result gets very little attention. Most researchers in statistics and machine learning seem to be unaware of the result. The result says that, “almost all” Bayesian prior distributions yield inconsistent posteriors, in a sense we’ll make precise below. The math is uncontroversial but, as you might imagine, the intepretation of the result is likely to be controversial.

Actually, I had planned to avoid all “Bayesian versus frequentist” stuff on this blog because it has been argued to death. But this particular result is so neat and clean (and under-appreciated) that I couldn’t resist. I will, however, resist drawing any philosophical conclusions from the result. I will merely tell you what the result is. Don’t shoot the messenger!

The paper is very short, barely more than two pages. My summary will be even shorter. (I’ll use slightly different notation.)

Let {X_1,\ldots, X_n} be an iid sample from a distribution {P} on the natural numbers {I=\{1,2,3,\ldots, \}}. Let {\Omega} be the set of all such distributions. We endow {\Omega} with the {{\rm weak}^*} topology. Hence, {P_n \rightarrow P} iff {P_n(i) \rightarrow P(i)} for all {i}.

Let {\mu} denote a prior distribution on {\Omega}. (More precisely, a prior on an appropriate {\sigma}-field, namely the Borel sets generated by the discrete topology.) Let {\Pi} be all priors. We endow the set of priors with the {{\rm weak}^*} topology. Thus {\mu_n\rightarrow \mu} iff {\int f d\mu_n \rightarrow \int f d\mu} for all bounded, continuous, real functions {f}.

Let {\mu_n} be the posterior corresponding to the prior {\mu} after {n} observations. We will say that the pair {(P,\mu)} is consistent if

\displaystyle  P^\infty ( \lim_{n\rightarrow\infty} \mu_n = \delta_P)=1

where {P^\infty} is the product measure corresponding to {P}, and {\delta_P} is a point mass at {P}.

Now we need to recall some topology. A set is nowhere dense if its closure has an empty interior. A set is meager (or first category) if it is a countable union of nowehere dense sets. Meager sets are small; think of a meager set as the topological version of a null set in measure theory.

Freedman’s theorem is: the sets of consistent pairs {(P,\mu)} is meager.

This means that, in a topological sense, consistency is rare for Bayesian procedures. From this result, it can also be shown that most pairs of priors lead to inferences that disagree. (The agreeing pairs are meager.) Or as Freedman says in his paper:

“ … it is easy to prove that for essentially any pair of Bayesians, each thinks the other is crazy.”

On the frequentist side, convergence is straightforward here.
Indeed, if p denotes the mass function and p_n the empirical mass function then

\sup_P P^n(||p-p_n||_\infty > \epsilon) \leq 4 e^{-n\epsilon^2/2}

In fact, even stronger statements can be made; see the recent paper by
Daniel Berend and Leo (Aryeh) Kontotovich
(paper is here).

As a postscript, let me add that David Freedman, who died in 2008, was a statistician at Berkeley. He was an impressive person whose work spanned from the very theoretical to the very applied. He was a bit of a curmudgeon, which perhaps lessened his influence a little. But he was a deep thinker with a healthy skepticism about the limits of statistical models, and I encourage any students reading this blog to seek out his work.

Statistics Versus Machine Learning

—Larry Wasserman

Welcome to my blog, which will discuss topics in Statistics and Machine Learning. Some posts will be technical  and others will be non-technical. Since this blog is about topics in both Statistics and Machine Learning, perhaps I should address the question: What is the difference between these two fields?

The short answer is: None. They are both concerned with the same question: how do we learn from data?

But a more nuanced view reveals that there are differences due to historical and sociological reasons. Statistics is an older field than Machine Learning (but young compared to Math, Physics etc). Thus, ideas about collecting and analyzing data in Statistics are rooted in the times before computers even existed. Of course, the field has adapted as times have changed but history matters and the result is that the way Statisticians think, teach, approach problems and choose research topics is often different than their colleagues in Machine Learning. I am fortunate to be at an institution (Carnegie Mellon) which is active in both (and I have appointments in both departments) so I get to see the similarities and differences.

If I had to summarize the main difference between the two fields I would say:

Statistics emphasizes formal statistical inference (confidence intervals, hypothesis tests, optimal estimators) in low dimensional problems.

Machine Learning emphasizes high dimensional prediction problems.

But this is a gross over-simplification. Perhaps it is better to list some topics that receive more attention from one field rather than the other. For example:

Statistics: survival analysis, spatial analysis, multiple testing, minimax theory, deconvolution, semiparametric inference, bootstrapping, time series.

Machine Learning: online learning, semisupervised learning, manifold learning, active learning, boosting.

But the differences become blurrier all the time. Check out two flagship journals:

The Annals of Statistics and
The Journal of Machine Learning Research.

The overlap in topics is striking. And many topics get started in one field and then are developed further in the other. For example, Reproducing Kernel Hilbert Space (RKHS) methods are hot in Machine Learning but they began in Statistics (thanks to Manny Parzen and Grace Wahba). Similarly, much of online learning has its roots in the work of the statisticians David Blackwell and Jim Hannan. And of course there are topics that are highly active in both areas such as concentration of measure, sparsity and convex optimization. There are also differences in terminology. Here are some examples:

Statistics       Machine Learning

———————————–

Estimation     Learning

Classifier      Hypothesis

Data point    Example/Instance

Regression   Supervised Learning

Classification Supervised Learning

Covariate      Feature

Response    Label

 

and of course:

Statisticians use R.

Machine Learners use Matlab.

 

Overall, the the two fields are blending together more and more and I think this is a good thing.

 

Follow

Get every new post delivered to your Inbox.

Join 959 other followers