The Robins-Ritov Example: A Post-Mortem
This post is follow-up to the two earlier posts on the Robins-Ritov example. We don’t want to prolong the debate but, rather, just summarize our main points.
- The Horwitz-Thompson estimator satisfies the following condition: for every ,
where — the parameter space — is the set of all functions . (There are practical improvements to the Horwitz-Thompson estimator that we discussed in our earlier posts but we won’t revisit those here.)
- A Bayes estimator requires a prior for . In general, if is not a function of then (1) will not hold. (And in our earlier post we argued that in realistic settings, the prior would in fact not depend on .)
- If you let be a function if , (1) still, in general, does not hold.
- If you make a function if in just the right way, then (1) will hold. Stefan Harmeling and Marc Toussaint have a nice paper which shows one way to do this. And we showed an improved Bayesian estimator that depends on in our earlier post. There is nothing wrong with doing this, but in our opinion this is not in the spirit of Bayesian inference. Constructing a Bayesian estimator to have good frequentist properties is really just frequentist inference.
- Chris Sims pointed out in his notes that the Bayes estimator does well in the parametric case. We agree: we never said otherwise. To quote from Chris’ notes: I think probably the arguments Robins and Wasserman want to make do depend fundamentally on infinite-dimensionality – that is, on considering a situation where lies in an infinite-dimensional space and we want to avoid restricting ourselves to a topologically small subset of that space in advance. That’s exactly correct. The problem we are discussing is the nonparametric case.
- The supremum in (1) is important. When we say that the estimator concentrates around the truth uniformly, we are referring to the presence of the supremum. A Bayes estimator can converge in the non-uniform sense. That is, it can satisfy
for some ‘s in . In particular, if the prior is highly concentrated around some function and if happens to be the true function, then of course something like (2) will hold. But if the prior is not concentrated around the truth, (1) won’t hold.
- This example is only meant to show that Bayesian estimators do not necessarily have good frequentist properties. This should not be surprising. There is no reason why we should in general expect a Bayesian method to have a frequentist property like (1).
- This example was presented in a simplified form to make it clear. In an observational study, the function is also unknown. In that case, when is high dimensional, the best that can be hoped for is a “doubly robust” (DR) estimator that performs well if either (but not necessarily both) or is accurately modelled. The locally semiparametric efficient regression of our original post with estimated is an example. DR estimators are now routinely used in biostatistics. They have also caught the attention of researchers at Google (Lambert and Pregibon 2007, Chan, Ge, Gershony, Hesterberg and Lambert 2010) and Yahoo! (Dudik, Langford and Li 2011). Bayesian approaches to modelling and have been used in the construction of the DR estimator (Cefalu, Dominici, and Parmigiani 2012).
2. A Sociological Comment
We are surprised by how defensive Bayesians are when we present this example. Consider the following (true) story.
One day, professor X showed LW an example where maximum likelihood does not do well. LW’s response was to shrug his shoulders and say: “that’s interesting. I won’t use maximum likelihood for that example.”
Professor X was surprised. He felt that by showing one example where maximum likelihood fails, he had discredited maximum likelihood. This is absurd. We use maximum likelihood when it works well and we don’t use maximum likelihood when it doesn’t work well.
When Bayesians see the Robins-Ritov example (or other similar examples) why don’t they just shrug their shoulders and say: “that’s interesting. I won’t use Bayesian inference for that example.” Some do. But some feel that if Bayes fails in one example then their whole world comes crashing down. This seems to us to be an over-reaction.
Cefalu, M. and Dominici, F. and Parmigiani, G. (2012). Model Averaged Double Robust Estimation. Harvard University Biostatistics Working Paper Series. link.
Chan, D., Ge, R., Gershony, O., Hesterberg, T. and Lambert, D. (2010). Evaluating online ad campaigns in a pipeline: causal models at scale. Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, 7-16.
Dudik, M., Langford, J. and Li, L. (2011). Doubly Robust Policy Evaluation and Learning. Arxiv preprint arXiv:1103.4601.
Harmeling, S. and Toussaint, M. (2007). Bayesian Estimators for Robins-Ritov’s Problem. Technical Report. University of Edinburgh, School of Informatics.
Lambert, D. and Pregibon, D. (2007). More bang for their bucks: assessing new features for online advertisers. Proceedings of the 1st international workshop on Data mining and audience intelligence for advertising, 7-15.