## The Robins-Ritov Example: A Post-Mortem

The Robins-Ritov Example: A Post-Mortem

This post is follow-up to the two earlier posts on the Robins-Ritov example. We don’t want to prolong the debate but, rather, just summarize our main points.

1. Summary

1. The Horwitz-Thompson estimator ${\hat \psi}$ satisfies the following condition: for every ${\epsilon>0}$,

$\displaystyle \sup_{\theta\in\Theta}\mathbb{P}(|\hat \psi - \psi| > \epsilon) \leq 2 \exp\left(- 2 n \epsilon^2 \delta^2\right) \ \ \ \ \ (1)$

where ${\Theta}$ — the parameter space — is the set of all functions ${\theta: [0,1]^d \rightarrow [0,1]}$. (There are practical improvements to the Horwitz-Thompson estimator that we discussed in our earlier posts but we won’t revisit those here.)

2. A Bayes estimator requires a prior ${W(\theta)}$ for ${\theta}$. In general, if ${W(\theta)}$ is not a function of ${\pi}$ then (1) will not hold. (And in our earlier post we argued that in realistic settings, the prior would in fact not depend on ${\pi}$.)
3. If you let ${W}$ be a function if ${\pi}$, (1) still, in general, does not hold.
4. If you make ${W}$ a function if ${\pi}$ in just the right way, then (1) will hold. Stefan Harmeling and Marc Toussaint have a nice paper which shows one way to do this. And we showed an improved Bayesian estimator that depends on ${\pi}$ in our earlier post. There is nothing wrong with doing this, but in our opinion this is not in the spirit of Bayesian inference. Constructing a Bayesian estimator to have good frequentist properties is really just frequentist inference.

5. Chris Sims pointed out in his notes that the Bayes estimator does well in the parametric case. We agree: we never said otherwise. To quote from Chris’ notes: I think probably the arguments Robins and Wasserman want to make do depend fundamentally on infinite-dimensionality – that is, on considering a situation where ${\theta(\cdot)}$ lies in an infinite-dimensional space and we want to avoid restricting ourselves to a topologically small subset of that space in advance. That’s exactly correct. The problem we are discussing is the nonparametric case.
6. The supremum in (1) is important. When we say that the estimator concentrates around the truth uniformly, we are referring to the presence of the supremum. A Bayes estimator can converge in the non-uniform sense. That is, it can satisfy

$\displaystyle \mathbb{P}(|\hat \psi - \psi| > \epsilon) \leq 2 \exp\left(- 2 n \epsilon^2 \delta^2\right) \ \ \ \ \ (2)$

for some ${\theta}$‘s in ${\Theta}$. In particular, if the prior ${W(\theta)}$ is highly concentrated around some function ${\theta_0}$ and if ${\theta_0}$ happens to be the true function, then of course something like (2) will hold. But if the prior is not concentrated around the truth, (1) won’t hold.

7. This example is only meant to show that Bayesian estimators do not necessarily have good frequentist properties. This should not be surprising. There is no reason why we should in general expect a Bayesian method to have a frequentist property like (1).
8. This example was presented in a simplified form to make it clear. In an observational study, the function ${\pi}$ is also unknown. In that case, when ${X}$ is high dimensional, the best that can be hoped for is a “doubly robust” (DR) estimator that performs well if either (but not necessarily both) ${\pi}$ or ${\theta}$ is accurately modelled. The locally semiparametric efficient regression of our original post with ${\pi}$ estimated is an example. DR estimators are now routinely used in biostatistics. They have also caught the attention of researchers at Google (Lambert and Pregibon 2007, Chan, Ge, Gershony, Hesterberg and Lambert 2010) and Yahoo! (Dudik, Langford and Li 2011). Bayesian approaches to modelling ${\pi}$ and ${\theta}$ have been used in the construction of the DR estimator (Cefalu, Dominici, and Parmigiani 2012).

2. A Sociological Comment

We are surprised by how defensive Bayesians are when we present this example. Consider the following (true) story.

One day, professor X showed LW an example where maximum likelihood does not do well. LW’s response was to shrug his shoulders and say: “that’s interesting. I won’t use maximum likelihood for that example.”

Professor X was surprised. He felt that by showing one example where maximum likelihood fails, he had discredited maximum likelihood. This is absurd. We use maximum likelihood when it works well and we don’t use maximum likelihood when it doesn’t work well.

When Bayesians see the Robins-Ritov example (or other similar examples) why don’t they just shrug their shoulders and say: “that’s interesting. I won’t use Bayesian inference for that example.” Some do. But some feel that if Bayes fails in one example then their whole world comes crashing down. This seems to us to be an over-reaction.

3. References

Cefalu, M. and Dominici, F. and Parmigiani, G. (2012). Model Averaged Double Robust Estimation. Harvard University Biostatistics Working Paper Series. link.

Chan, D., Ge, R., Gershony, O., Hesterberg, T. and Lambert, D. (2010). Evaluating online ad campaigns in a pipeline: causal models at scale. Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, 7-16.

Dudik, M., Langford, J. and Li, L. (2011). Doubly Robust Policy Evaluation and Learning. Arxiv preprint arXiv:1103.4601.

Harmeling, S. and Toussaint, M. (2007). Bayesian Estimators for Robins-Ritov’s Problem. Technical Report. University of Edinburgh, School of Informatics.

Lambert, D. and Pregibon, D. (2007). More bang for their bucks: assessing new features for online advertisers. Proceedings of the 1st international workshop on Data mining and audience intelligence for advertising, 7-15.

1. Posted October 14, 2012 at 9:09 am | Permalink

Larry: On your sociological point: yes, Bayesians are oddly thin-skinned. Their philosophy calls for a global method with an underlying universal philosophical foundation. Even when they have to patch things up, as they do in practice, they want to believe it’s part of an unfalsified (if unfalsifiable), deep, overarching scheme of reasoning and learning.

2. Corey
Posted October 14, 2012 at 3:26 pm | Permalink

It seems to me that most Bayesians have an ideological commitment to the Bayesian approach — we accept arguments that show that Bayesian is the unique method of learning that has certain desired properties. For subjective Bayesians the desired property is diachronic Dutch book immunity; for Jaynesians such as myself, the desired property is (essentially) consistency with classical logic. What I suppose most Bayesians overlook is that these arguments are strictly about updating and offer no protection against bad priors — GIGO still applies.

For the prior distribution, one must go further than the usual Bayesian optimality arguments. (In a previous comment I’ve briefly described what I regard as the correct principle for the prior and the implications of the Robins-Ritov result in light of that principle.)

3. Drew Bagnell
Posted October 15, 2012 at 10:58 am | Permalink

Bayesian or not bayesian seems irrelevant here. From a machine learning viewpoint, the goal is to identify general principles for inference that can be systematized. Without that, every problem requires a human “in the loop”, and at least one goal of machine learning research is to minimize the need for such art.

4. Keith O'Rourke
Posted October 17, 2012 at 8:37 am | Permalink

“by showing one example where method X fails, he had discredited method X”
Larry, that was the strategy, I found most peculiar when I entered graduate school in Biostatistics.

This academic _boxing match_ has been interesting and provided a case study on communication of complex statistical arguments in blog medium. If there is a next time, perhaps a referee to clarify and enforce the rules (the sets of assumptions the antagonists will stick to) as this is hard to enforce unilaterally.

My sense is, that although this may be a statistical problem, it is not a biostatistical problem in that it would not apply to a biological (genetically based) population but just to a virtual populations for instance as in the Sims game (where I should clarify that I am referring to the strategic life-simulation video game.)

The model is of a population where almost everyone is an island unto themselves (with respect to a risk) and they are uniformly distributed but the purposefulness of estimating the average risk requires that such a population is stable and does not evolve at least too quickly (I think the bigger wrong). Here I am disagreeing with Stefan Harmeling and Marc Toussaint about needing hyper parameters to justify averaging different theta (apples and oranges) – you just need a well defined population for whom that average (fruit salad) is meaningful (and it would be purposeful if stable).