Robins and Wasserman Respond to a Nobel Prize Winner Continued: A Counterexample to Bayesian Inference?
This is a response to Chris Sims’ comments on our previous blog post. Because of the length of our response, we are making this a new post rather than putting it in the comments of the last post.
Recall that we observe iid observations , where and are Bernoulli and independent given .
Define and . We assume that is a known function. Also the marginal
density of (with ) is known and uniform on the unit cube in . Our goal is estimation of
$latex \displaystyle
\psi \equiv E\left[ Y\right] =E\left\{ E\left[ YX\right] \right\} =
E\left\{E\left[ YX,R=1\right] \right\} =\int_{[0,1]^d} \theta \left( x\right) dx.
&fg=000000$
The likelihood
$latex \displaystyle
\prod_{i=1}^{n}p(X_{i})p(R_{i}X_{i})p(Y_{i}X_{i})^{R_{i}}=\left\{
\prod_{i}\pi (X_{i})^{R_{i}}(1\pi (X_{i}))^{1R_{i}}\right\} \left\{
\prod_{i}\,\theta (X_{i})^{Y_{i}R_{i}}(1\theta
(X_{i}))^{(1Y_{i})R_{i}}\right\} \
&fg=000000$
factors into two parts – the first depending on and the second on .
1. Selection Bias
This is a point of agreement. There is selection bias if and only if
$latex \displaystyle
Cov\left\{ \theta \left(X\right) ,\pi \left( X\right) \right\}=0.
&fg=000000$
Note that
$latex \displaystyle
E\left[ Y\right] =E\left[ YR=1\right] –
\frac{Cov\left\{ \theta \left(X\right) ,\pi \left( X\right) \right\} }{E[R]}.
&fg=000000$
Hence, if then
the sample average of in the subset whose is observed (R=1) is unbiased for
$latex \displaystyle
\psi \equiv E\left[ Y\right] =\int_{[0,1]^d}\ \theta \left( x\right) dx.
&fg=000000$
In this case, inference is easy for the Bayesian and the frequentist and there is no issue. So we all agree that the interesting case is where there is selection bias,
that is, where
2. Posterior Dependence on
If the prior on the functions and is such that then the posterior does not depend on and the posterior for will not concentrate around the true value of . Again, we believe we all agree on this point.
We note that no one, Bayesian or frequentist, has ever proposed using an estimator that does not depend on in the selection bias case, i.e when is nonzero. (See addendum for more on this point.)
3. Prior Independence Versus Selection Bias
Reading Chris’ comments, the reader might get the impression that prior independence rules out selection bias, i.e. that is,
$latex \displaystyle W(\pi,\theta) =W(\pi)W(\theta)\ \ \ \ {\rm implies \ that\ }\ \ \
Cov\left\{ \theta \left(X\right) ,\pi \left( X\right) \right\}=0.&fg=000000$
Therefore, one might conclude that if we want to discuss the interesting case where there is selection bias, then we cannot have .
But this is incorrect. does not imply that To see this, consider the following example.
Suppose that is one dimensional and a Bayesian’s prior for depends only on the two parameters as follows:
$latex \displaystyle
\theta \left( x\right) =\alpha _{\theta }x,\ \ \
\pi \left( x\right) =\alpha _{\pi}x+1/10 \ \ \
\text{with}\ \alpha _{\theta }\text{ and }\alpha _{\pi }\text{ a\ priori\ independent, }
&fg=000000$
where is uniform on and is uniform on .
Then, clearly and are independent under . However, recalling is uniform so we have that for for any fixed ,
$latex \displaystyle
Cov\left\{ \theta \left( X\right) ,\pi \left( X\right) \right\}
=\int_{0}^{1}\theta \left( x\right) \pi \left( x\right) dx\int_{0}^{1}\ \pi
\left( x\right) dx\int_{0}^{1}\theta \left( x\right) dx \\
=\alpha _{\theta }\alpha _{\pi }\left( \int_{0}^{1}x^{2}dx\left\{
\int_{0}^{1}xdx\right\} ^{2}\right) =\alpha _{\theta }\alpha _{\pi }/12 .
&fg=000000$
Hence
$latex \displaystyle
W({\rm there\ exists\ selection\ bias})=W\Biggl( Cov\left\{ \theta \left( X\right) ,\pi \left( X\right) \right\} >0 \Biggr) =1
&fg=000000$
since and are both positive with probability 1.
4. Other Justifications For Prior Dependence?
Since prior independence of and does not imply “no selection bias,” one might instead argue that it is practically unrealistic to have . But we now show that it is realistic.
Suppose a new HMO needs to estimate the fraction of its patient population that will have a MI in the next year, so as to determine the number of cardiac unit beds needed. Each HMO member has had 300 potential risk factors measured: age, weight height, blood pressure, multiple tests of liver, renal, pulmonary, and cardiac function, good and bad cholesterol, packs per day smoked, years smoked, etc. (We will get to 100,000 once routine genomic testing becomes feasible). A general epidemiologist had earlier studied risk factors for MI
by following 5000 of the 50,000 HMO members for a year. Because MI is a rare event, he oversampled subjects whose , in his opinion, indicated a
smaller probability of an MI (. Hence the
sampling fraction was a known, but complex function chosen so as to try to make and negatively correlated.
The world’s leading heart expert, our Bayesian, was hired to estimate based on distribution of in HMO members and the data from the study.
As world’s expert, his beliefs about the risk function would not change upon learning as only reflects a nonexpert’s beliefs. Hence and are a priori independent. Nonetheless, knowing that the epidemiologist had carefully read the expert literature on risk factors for MI, he also believes with high probability that epidemiologist succeeded in having the random variables and be negatively correlated.
What’s more, Robins and Ritov (1997) showed that, if before seeing the data, any Bayesian, cardiac expert or not, thoroughly queries the epidemiologist
(who selected ) about the epidemiologist’s reasoned opinions concerning (but not about ), the Bayesian will then have independent priors. The idea is that once you are satisfied that you have learned from the epidemiologist all he knows about that you did not, you will have an updated prior for
. Your prior for $latex {\theta \left( \cdot \right)
\ (}&fg=000000$now updated) cannot then change if you subsequently are told $latex {\pi \left(
\cdot \right) .}&fg=000000$ Hence, we could take as many Bayesians as you please and arrange it so all had and apriori independent. This last argument is quite general, applying to many settings.
5. Alternative Interpretation
An alternative reading of Chris’s third response and his subsequent post is that, rather than placing a joint prior over the functions
and as above,
his prior is placed over the joint distribution of the random variables and .
If so, he is then correct that making and independent with prior probability one
also implies and thus no selection bias.
However, it appears that from this, he concludes that selection bias, in itself, licenses the dependence of his posterior on .
This is incorrect. As noted above, it is prior dependence of and that licenses posterior dependence on – not prior dependence of and
. Were he correct, our Bayesian cardiac expert’s prior on could have changed upon learning the epidemiologists .
6. What If We Do Use a Prior That Depends on ?
In the above scenario, should ot depend on . But suppose, for whatever reason, one insists on letting depend on .
That still does not mean the posterior will concentrate. Having an estimator that depends on is necessary, but not sufficient, to get consistency and fast rates. It is not enough to use a prior that is a function of . The prior still has to be carefully engineered to ensure that the posterior for will concentrate around the truth.
Chris hints that he can construct such a prior but does not provide an explicit algorithm nor an argument as to why the estimator would be expected to be locally semiparametric efficient. However, it is simple to construct a consistent
locally semiparametric efficient Bayes estimator as follows.
We tentatively model as a finite dimensional parametric function
with either a smooth or noninformative prior on the parameters , where we take
$latex \displaystyle b\left( x;\eta_{1},\ldots ,\eta _{k},\omega \right) = \mathrm{expit}\left( \sum_{m=1}^{k}\eta_{m}\phi _{m}(x)+\frac{\omega }{\pi (x)}\right) ,
&fg=000000$
, and the are basis functions. Then the posterior mean
of will have the same asymptotic distribution as the locally semiparametric efficient regression estimator of Scharfstein et al. (1999) described in our original post. Note that the estimator is consistent, even if the model is wrong.
Of course, this estimator is a clear case of frequentist pursuit Bayes.
7. Conclusion
Here are the main points:
 If then the posterior will not concentrate.
Thus, if a Bayesian wants the posterior for to concentrate around the true value,
he must justify having a prior that is a function of .

does not imply an absence of selection bias.
Therefore, an argument of the form: “we want selection bias so we cannot have prior independence” fails.
 One can try to argue that prior independence is unrealistic. But as we have shown, this is not the case.

But, if after all this, we do insist on letting depend on ,
it is still not enough. Dependence on is necessary but not sufficient.
We conclude Bayes fails in our example unless one uses a special prior designed just to mimic the frequentist estimator.
8. Addendum: What happens If The Estimator Does Note Depend on ?
The theorem of Robins and Ritov, quoted in our initial post, says that no uniformly consistent estimator that does not depend on $latex {\pi
\left( \cdot \right) }&fg=000000$ can exist in the model which contains all measurable and subject
to with probability 1. Take for
concreteness. In fact, even when we assume
and are quite smooth, there will be little
improvement in performance.
Given has 100,000 dimensions, we can ask how many derivatives and must
and have so that it is possible to construct an estimator of , not depending on that
converges at rate uniformly to over a submodel . Robins et al. (2008) show that it is necessary and sufficient that + and provide an
explicit estimator. More generally, if + derivatives with , the optimal rate is which is approximately when is small compared to An explicit estimator
is constructed in Robins et al (2008) ; Robins et al (2009) prove that the rate cannot be improved on. Given these asymptotic mathematical results, we
doubt any reader can exhibit an estimator, not depending on that will have reasonable finite sample performance under
model or even with, say,
and a sample size of 5,000. By reasonable finite sample performance, we mean an interval estimator that will cover the true at least 95% of the time and that has average length less than or equal to intervals estimators
centered on the improved HT estimators. Nonetheless, we
await any candidate estimators, accompanied by at least some simulation
evidence backing up your claim.
9. References
 Robins JM, Tchetgen E, Li L, van der Vaart A. (2009).
Semiparametric Minimax Rates. Electron. J. Statist. Volume 3 (2009),
13051321.
 Robins JM, Li L, Tchetgen E, van der Vaart A. (2008). Higher order influence
functions and minimax estimation of nonlinear functionals. Probability and
Statistics: Essays in Honor of David A. Freedman 2:335421
 Robins JM, Ritov Y. (1997). Toward a curse of dimensionality appropriate
(CODA) asymptotic theory for semiparametric models. Statistics in Medicine,
16:285319.