The Robins-Ritov Example: A Post-Mortem
This post is follow-up to the two earlier posts on the Robins-Ritov example. We don’t want to prolong the debate but, rather, just summarize our main points.
1. Summary
- The Horwitz-Thompson estimator
satisfies the following condition: for every
,
where— the parameter space — is the set of all functions
. (There are practical improvements to the Horwitz-Thompson estimator that we discussed in our earlier posts but we won’t revisit those here.)
- A Bayes estimator requires a prior
for
. In general, if
is not a function of
then (1) will not hold. (And in our earlier post we argued that in realistic settings, the prior would in fact not depend on
.)
- If you let
be a function if
, (1) still, in general, does not hold.
- If you make
a function if
in just the right way, then (1) will hold. Stefan Harmeling and Marc Toussaint have a nice paper which shows one way to do this. And we showed an improved Bayesian estimator that depends on
in our earlier post. There is nothing wrong with doing this, but in our opinion this is not in the spirit of Bayesian inference. Constructing a Bayesian estimator to have good frequentist properties is really just frequentist inference.
- Chris Sims pointed out in his notes that the Bayes estimator does well in the parametric case. We agree: we never said otherwise. To quote from Chris’ notes: I think probably the arguments Robins and Wasserman want to make do depend fundamentally on infinite-dimensionality – that is, on considering a situation where
lies in an infinite-dimensional space and we want to avoid restricting ourselves to a topologically small subset of that space in advance. That’s exactly correct. The problem we are discussing is the nonparametric case.
- The supremum in (1) is important. When we say that the estimator concentrates around the truth uniformly, we are referring to the presence of the supremum. A Bayes estimator can converge in the non-uniform sense. That is, it can satisfy
for some‘s in
. In particular, if the prior
is highly concentrated around some function
and if
happens to be the true function, then of course something like (2) will hold. But if the prior is not concentrated around the truth, (1) won’t hold.
- This example is only meant to show that Bayesian estimators do not necessarily have good frequentist properties. This should not be surprising. There is no reason why we should in general expect a Bayesian method to have a frequentist property like (1).
- This example was presented in a simplified form to make it clear. In an observational study, the function
is also unknown. In that case, when
is high dimensional, the best that can be hoped for is a “doubly robust” (DR) estimator that performs well if either (but not necessarily both)
or
is accurately modelled. The locally semiparametric efficient regression of our original post with
estimated is an example. DR estimators are now routinely used in biostatistics. They have also caught the attention of researchers at Google (Lambert and Pregibon 2007, Chan, Ge, Gershony, Hesterberg and Lambert 2010) and Yahoo! (Dudik, Langford and Li 2011). Bayesian approaches to modelling
and
have been used in the construction of the DR estimator (Cefalu, Dominici, and Parmigiani 2012).
2. A Sociological Comment
We are surprised by how defensive Bayesians are when we present this example. Consider the following (true) story.
One day, professor X showed LW an example where maximum likelihood does not do well. LW’s response was to shrug his shoulders and say: “that’s interesting. I won’t use maximum likelihood for that example.”
Professor X was surprised. He felt that by showing one example where maximum likelihood fails, he had discredited maximum likelihood. This is absurd. We use maximum likelihood when it works well and we don’t use maximum likelihood when it doesn’t work well.
When Bayesians see the Robins-Ritov example (or other similar examples) why don’t they just shrug their shoulders and say: “that’s interesting. I won’t use Bayesian inference for that example.” Some do. But some feel that if Bayes fails in one example then their whole world comes crashing down. This seems to us to be an over-reaction.
3. References
Cefalu, M. and Dominici, F. and Parmigiani, G. (2012). Model Averaged Double Robust Estimation. Harvard University Biostatistics Working Paper Series. link.
Chan, D., Ge, R., Gershony, O., Hesterberg, T. and Lambert, D. (2010). Evaluating online ad campaigns in a pipeline: causal models at scale. Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, 7-16.
Dudik, M., Langford, J. and Li, L. (2011). Doubly Robust Policy Evaluation and Learning. Arxiv preprint arXiv:1103.4601.
Harmeling, S. and Toussaint, M. (2007). Bayesian Estimators for Robins-Ritov’s Problem. Technical Report. University of Edinburgh, School of Informatics.
Lambert, D. and Pregibon, D. (2007). More bang for their bucks: assessing new features for online advertisers. Proceedings of the 1st international workshop on Data mining and audience intelligence for advertising, 7-15.
7 Comments
Larry: On your sociological point: yes, Bayesians are oddly thin-skinned. Their philosophy calls for a global method with an underlying universal philosophical foundation. Even when they have to patch things up, as they do in practice, they want to believe it’s part of an unfalsified (if unfalsifiable), deep, overarching scheme of reasoning and learning.
It seems to me that most Bayesians have an ideological commitment to the Bayesian approach — we accept arguments that show that Bayesian is the unique method of learning that has certain desired properties. For subjective Bayesians the desired property is diachronic Dutch book immunity; for Jaynesians such as myself, the desired property is (essentially) consistency with classical logic. What I suppose most Bayesians overlook is that these arguments are strictly about updating and offer no protection against bad priors — GIGO still applies.
For the prior distribution, one must go further than the usual Bayesian optimality arguments. (In a previous comment I’ve briefly described what I regard as the correct principle for the prior and the implications of the Robins-Ritov result in light of that principle.)
Corey: But the Bayesians have (mostly) run away from the diachronic Dutch book immunity–remember my blog on this?
I’m not sure which Bayesians have “run away” from diachronic Dutch book other than Jon Williamson. (I continue to contend that Williamson’s purported reductio of the diachronic Dutch book argument begs the question.)
Corey: I was referring, among philosophers, to the kind of example discussed in Kyburg, Howson and other philosophers from 20+ years ago. (Kyburg wasn’t a Bayesian, but he emphasized and elaborated on the kind of example raised by others. See, for example, my blogpost http://errorstatistics.com/2012/04/15/3376/.
For statisticians, see
http://errorstatistics.com/2012/05/20/betting-bookies-and-bayes-does-it-not-matter/
http://errorstatistics.blogspot.com/2012/01/you-may-believe-you-are-bayesian-but.html
Senn, S. (2011), “You May Believe You Are a Bayesian But You Are Probably Wrong” (RMM) Vol. 2, 2011, 48–66.
Bayesian or not bayesian seems irrelevant here. From a machine learning viewpoint, the goal is to identify general principles for inference that can be systematized. Without that, every problem requires a human “in the loop”, and at least one goal of machine learning research is to minimize the need for such art.
“by showing one example where method X fails, he had discredited method X”
Larry, that was the strategy, I found most peculiar when I entered graduate school in Biostatistics.
This academic _boxing match_ has been interesting and provided a case study on communication of complex statistical arguments in blog medium. If there is a next time, perhaps a referee to clarify and enforce the rules (the sets of assumptions the antagonists will stick to) as this is hard to enforce unilaterally.
My sense is, that although this may be a statistical problem, it is not a biostatistical problem in that it would not apply to a biological (genetically based) population but just to a virtual populations for instance as in the Sims game (where I should clarify that I am referring to the strategic life-simulation video game.)
The model is of a population where almost everyone is an island unto themselves (with respect to a risk) and they are uniformly distributed but the purposefulness of estimating the average risk requires that such a population is stable and does not evolve at least too quickly (I think the bigger wrong). Here I am disagreeing with Stefan Harmeling and Marc Toussaint about needing hyper parameters to justify averaging different theta (apples and oranges) – you just need a well defined population for whom that average (fruit salad) is meaningful (and it would be purposeful if stable).
One Trackback
[…] Hogg pointed me to this post by Larry […]