Peter Grunwald gave a talk in the statistics department on Monday. Peter does very interesting work and the material he spoke about is no exception. Here are my recollections from the talk.
The summary is this: Peter and John Langford have a very cool example of Bayesian inconsistency, much different than the usual examples of inconsistency. In the talk, Peter explained the inconsistency and then he talked about a way to fix the inconsistency.
All previous examples of inconsistency in Bayesian inference that I know of have two things in common: the parameter space is complicated and the prior does not put enough mass around the true distribution. The Grunwald-Langford example is much different.
Let be a countable parameter space. We start with the very realistic assumption that the model is wrong. That is, the true distribution is not in . It is generally believed in this case that the posterior concentrates near , the distribution in closest (in Kullback-Leibler distance) to . In Peter and John’s example, the posterior edoes not concentrate around . What is surprising, is that this inconsistency holds, even though the space is countable and even though the prior puts positive mass on . If this doesn’t surprise you, it should.
On the other hand, there are papers like Kleijn and van der Vaart (The Annals of Statistics, 2006, pages 837–877) that show that the posterior does indeed concentrate around . So what is going on?
The key is that in the Grunwald-Langford example, the space is not convex. (More precisely, the projection of onto does not equal the projection of onto the convex hull of .)
You can fix the problem by replacing with its convex hull. But this is not a good fix. To see why, suppose that each corresponds to some classifier in some set . If correspond to two different classifiers , the mixture might not correspond to any classifier in . Forming mixtures might take you out of the class you are interested in.
Instead, Peter has a better fix. Instead of using the posterior distribution, use the generalized posterior where is the prior, is the likelihood and is a constant that can change with sample size . It turns out that there is a constant such that the generalized posterior is consistent as long as .
The problem is that we don’t know . Now comes a key observation. Suppose you wanted to predict a new observation. In Bayesian inference, one usually uses the predictive distribution which is a mixture with weights based on the posterior. Consider instead predicting using where is randomly chosen from the posterior. Peter calls these “mixture prediction” and “randomized prediction.” If the posterior is concentrated, then these two types of prediction are similar. He uses the difference between the mixture prediction and the randomized prediction to estimate . (More precisely, he builds a procedure that mimics a generalized posterior based on a good .)
The result is a “fixed-up” Bayesian procedure that automatically repairs itself to avoid inconsistency. This is quite remarkable.
Unfortunately, Peter has not finished the paper yet so we will have to wait a while for all the details. (Peter says he might start is own blog so perhaps we’ll be able to read about the paper on his blog.)
— Larry Wasserman