## Self-Repairing Bayesian Inference

Peter Grunwald gave a talk in the statistics department on Monday. Peter does very interesting work and the material he spoke about is no exception. Here are my recollections from the talk.

The summary is this: Peter and John Langford have a very cool example of Bayesian inconsistency, much different than the usual examples of inconsistency. In the talk, Peter explained the inconsistency and then he talked about a way to fix the inconsistency.

All previous examples of inconsistency in Bayesian inference that I know of have two things in common: the parameter space is complicated and the prior does not put enough mass around the true distribution. The Grunwald-Langford example is much different.

Let ${{\cal P}}$ be a countable parameter space. We start with the very realistic assumption that the model ${{\cal P}}$ is wrong. That is, the true distribution ${P_*}$ is not in ${{\cal P}}$. It is generally believed in this case that the posterior concentrates near ${Q_*}$, the distribution in ${{\cal P}}$ closest (in Kullback-Leibler distance) to ${P_*}$. In Peter and John’s example, the posterior edoes not concentrate around ${Q_*}$. What is surprising, is that this inconsistency holds, even though the space is countable and even though the prior puts positive mass on ${Q_*}$. If this doesn’t surprise you, it should.

On the other hand, there are papers like Kleijn and van der Vaart (The Annals of Statistics, 2006, pages 837–877) that show that the posterior does indeed concentrate around ${Q_*}$. So what is going on?

The key is that in the Grunwald-Langford example, the space ${{\cal P}}$ is not convex. (More precisely, the projection of ${P_*}$ onto ${{\cal P}}$ does not equal the projection of ${P_*}$ onto the convex hull of ${{\cal P}}$.)

You can fix the problem by replacing ${{\cal P}}$ with its convex hull. But this is not a good fix. To see why, suppose that each ${p\in {\cal P}}$ corresponds to some classifier ${h}$ in some set ${{\cal H}}$. If ${p_1,p_2\in {\cal P}}$ correspond to two different classifiers ${h_1, h_2}$, the mixture ${(p_1 + p_2)/2}$ might not correspond to any classifier in ${{\cal H}}$. Forming mixtures might take you out of the class you are interested in.

Instead, Peter has a better fix. Instead of using the posterior distribution, use the generalized posterior ${g(\theta)\propto \pi(\theta) L(\theta)^\eta}$ where ${\pi}$ is the prior, ${L}$ is the likelihood and ${0\leq \eta \leq 1}$ is a constant that can change with sample size ${n}$. It turns out that there is a constant ${\eta_{\rm crit}}$ such that the generalized posterior is consistent as long as ${\eta < \eta_{\sf crit}}$.

The problem is that we don’t know ${\eta_{\sf crit}}$. Now comes a key observation. Suppose you wanted to predict a new observation. In Bayesian inference, one usually uses the predictive distribution ${p(y|{\rm data}) = \int p(y|\theta) p(\theta|{\rm data})d\theta}$ which is a mixture with weights based on the posterior. Consider instead predicting using ${p(y|\theta)}$ where ${\theta}$ is randomly chosen from the posterior. Peter calls these “mixture prediction” and “randomized prediction.” If the posterior is concentrated, then these two types of prediction are similar. He uses the difference between the mixture prediction and the randomized prediction to estimate ${\eta}$. (More precisely, he builds a procedure that mimics a generalized posterior based on a good ${\eta}$.)

The result is a “fixed-up” Bayesian procedure that automatically repairs itself to avoid inconsistency. This is quite remarkable.

Unfortunately, Peter has not finished the paper yet so we will have to wait a while for all the details. (Peter says he might start is own blog so perhaps we’ll be able to read about the paper on his blog.)

Postscript: Congratulations to Cosma for winning second place in the best science blog contest.

— Larry Wasserman

1. Keith O'Rourke
Posted June 27, 2012 at 9:02 pm | Permalink

Larry, I have thought about this for a long time – but this comment is from the hip.

I think it was called “Looking for the Jaborwocki” but the idea I encountered before entering biostatistics was that a model (or a representation of something) may well imply a Jaborwocki (no cognasent being should or even could doubt this if they understood the representation) but they should not be disppointed if they could not find the Jaborwocki in what was being represented.

Assuming the universe is finite, implications of non-finite models need not apply to anything that will happen in the particular universe I happen to inhabit. This could be put as “if it can’t be simmulated (a necessarily finite approximation of a probability model) one need not worry about it in any brute realities they may need to address.

If I was convinced I was not in some sense wrong, I would not post this – but I am yet to be convinced this sort of thing must concern me.

Cheers
Keith

• Posted July 12, 2012 at 4:53 pm | Permalink

Dear Keith,

This is of course an important issue.
In the example Larry refers to, things actually do go wrong terribly also in small samples (in fact I did some simulations)
– you have a reasonable but not perfect approximation to the true distribution with a very high prior (say, 1/2)
and you have many much worse approximations with much smaller priors. Yet these bad approximations keep getting almost all posterior mass.

The extension to countably infinite models is only there to state the result in a way that says: ‘no matter how many data you observe, the phenomenon will never go away’. But it’s certainly relevant for small samples as well (other, non-Bayesian methods do pick up the best approximation to the ‘truth’ very fast).

The point of my new paper is to have the best of both worlds – perform as well as the Bayesian methods when the model is correct, and as well as these other methods if the model is wrong.

Best wishes,

Peter

• Keith O'Rourke
Posted July 13, 2012 at 11:26 am | Permalink

Thanks for the clarification and the especially clear picture.

For those that may be interested, I also found a paper that elabourates related concerns as I was raising:
“Asymptotics of Maximum Likelihood without the LLN or CLT or Sample Size Going to Infinity” Charles J. Geyer

Thanks to arxiv: http://arxiv.org/abs/1206.4762

2. Posted July 12, 2012 at 4:48 pm | Permalink

Hi Larry and others,

Thanks for this 100% accurate (!) recollection of my talk.
A preliminary version of part of this work was just accepted for the ALT (Algorithmic Learning Theory) Conference 2012.
I just put the paper, called

The Safe Bayesian: learning the learning rate via the mixability gap

on my webpage: http://homepages.cwi.nl/~pdg/ftp/alt12longer.pdf

The paper includes the ‘picture that says it all’.
I’m still struggling with writing a longer version explaining all the connections to Tsybakov exponents etc.

• Posted July 12, 2012 at 5:48 pm | Permalink

Thanks Peter

—LW