Robins and Wasserman Respond to a Nobel Prize Winner Continued: A Counterexample to Bayesian Inference?
This is a response to Chris Sims’ comments on our previous blog post. Because of the length of our response, we are making this a new post rather than putting it in the comments of the last post.
Recall that we observe
iid observations
, where
and
are Bernoulli and independent given
.
Define
and
. We assume that
is a known function. Also the marginal
density
of
(with
) is known and uniform on the unit cube in
. Our goal is estimation of
![\displaystyle \psi \equiv E\left[ Y\right] =E\left\{ E\left[ Y|X\right] \right\} = E\left\{E\left[ Y|X,R=1\right] \right\} =\int_{[0,1]^d} \theta \left( x\right) dx.](https://s0.wp.com/latex.php?latex=%5Cdisplaystyle+++%5Cpsi+%5Cequiv+E%5Cleft%5B+Y%5Cright%5D+%3DE%5Cleft%5C%7B+E%5Cleft%5B+Y%7CX%5Cright%5D+%5Cright%5C%7D+%3D++E%5Cleft%5C%7BE%5Cleft%5B+Y%7CX%2CR%3D1%5Cright%5D+%5Cright%5C%7D+%3D%5Cint_%7B%5B0%2C1%5D%5Ed%7D+%5Ctheta+%5Cleft%28+x%5Cright%29+dx.++&bg=ffffff&fg=000000&s=0&c=20201002)
The likelihood

factors into two parts – the first depending on
and the second on
.
1. Selection Bias
This is a point of agreement. There is selection bias if and only if

Note that
![\displaystyle E\left[ Y\right] =E\left[ Y|R=1\right] - \frac{Cov\left\{ \theta \left(X\right) ,\pi \left( X\right) \right\} }{E[R]}.](https://s0.wp.com/latex.php?latex=%5Cdisplaystyle+++E%5Cleft%5B+Y%5Cright%5D+%3DE%5Cleft%5B+Y%7CR%3D1%5Cright%5D+-++%5Cfrac%7BCov%5Cleft%5C%7B+%5Ctheta+%5Cleft%28X%5Cright%29+%2C%5Cpi+%5Cleft%28+X%5Cright%29+%5Cright%5C%7D+%7D%7BE%5BR%5D%7D.++&bg=ffffff&fg=000000&s=0&c=20201002)
Hence, if
then
the sample average of
in the subset whose
is observed (R=1) is unbiased for
![\displaystyle \psi \equiv E\left[ Y\right] =\int_{[0,1]^d}\ \theta \left( x\right) dx.](https://s0.wp.com/latex.php?latex=%5Cdisplaystyle+++%5Cpsi+%5Cequiv+E%5Cleft%5B+Y%5Cright%5D+%3D%5Cint_%7B%5B0%2C1%5D%5Ed%7D%5C+%5Ctheta+%5Cleft%28+x%5Cright%29+dx.++&bg=ffffff&fg=000000&s=0&c=20201002)
In this case, inference is easy for the Bayesian and the frequentist and there is no issue. So we all agree that the interesting case is where there is selection bias,
that is, where 
2. Posterior Dependence on
If the prior
on the functions
and
is such that
then the posterior does not depend on
and the posterior for
will not concentrate around the true value of
. Again, we believe we all agree on this point.
We note that no one, Bayesian or frequentist, has ever proposed using an estimator that does not depend on
in the selection bias case, i.e when
is non-zero. (See addendum for more on this point.)
3. Prior Independence Versus Selection Bias
Reading Chris’ comments, the reader might get the impression that prior independence rules out selection bias, i.e. that is,

Therefore, one might conclude that if we want to discuss the interesting case where there is selection bias, then we cannot have
.
But this is incorrect.
does not imply that
To see this, consider the following example.
Suppose that
is one dimensional and a Bayesian’s prior
for
depends only on the two parameters
as follows:

where
is uniform on
and
is uniform on
.
Then, clearly
and
are independent under
. However, recalling
is uniform so
we have that for for any fixed
,

Hence

since
and
are both positive with
probability 1.
4. Other Justifications For Prior Dependence?
Since prior independence of
and
does not imply “no selection bias,” one might instead argue that it is practically unrealistic to have
. But we now show that it is realistic.
Suppose a new HMO needs to estimate the fraction
of its patient population that will have a MI
in the next year, so as to determine the number of cardiac unit beds needed. Each HMO member has had 300 potential risk factors
measured: age, weight height, blood pressure, multiple tests of liver, renal, pulmonary, and cardiac function, good and bad cholesterol, packs per day smoked, years smoked, etc. (We will get to 100,000 once routine genomic testing becomes feasible). A general epidemiologist had earlier studied risk factors for MI
by following 5000 of the 50,000 HMO members for a year. Because MI is a rare event, he oversampled subjects whose
, in his opinion, indicated a
smaller probability
of an MI (
. Hence the
sampling fraction
was a known, but complex function chosen so as to try to make
and
negatively correlated.
The world’s leading heart expert, our Bayesian, was hired to estimate
based on distribution
of
in HMO members and the data
from the study.
As world’s expert, his beliefs about the risk function
would not change upon learning
as
only reflects a nonexpert’s beliefs. Hence
and
are a priori independent. Nonetheless, knowing that the epidemiologist had carefully read the expert literature on risk factors for MI, he also believes with high probability that epidemiologist succeeded in having the random variables
and
be negatively correlated.
What’s more, Robins and Ritov (1997) showed that, if before seeing the data, any Bayesian, cardiac expert or not, thoroughly queries the epidemiologist
(who selected
) about the epidemiologist’s reasoned opinions concerning
(but not about
), the Bayesian will then have independent priors. The idea is that once you are satisfied that you have learned from the epidemiologist all he knows about
that you did not, you will have an updated prior for
. Your prior for
now updated) cannot then change if you subsequently are told
Hence, we could take as many Bayesians as you please and arrange it so all had
and
apriori independent. This last argument is quite general, applying to many settings.
5. Alternative Interpretation
An alternative reading of Chris’s third response and his subsequent post is that, rather than placing a joint prior
over the functions
and
as above,
his prior is placed over the joint distribution of the random variables
and
.
If so, he is then correct that making
and
independent with prior probability one
also implies
and thus no selection bias.
However, it appears that from this, he concludes that selection bias, in itself, licenses the dependence of his posterior on
.
This is incorrect. As noted above, it is prior dependence of
and
that licenses posterior dependence on
– not prior dependence of
and
. Were he correct, our Bayesian cardiac expert’s prior on
could have changed upon learning the epidemiologists
.
6. What If We Do Use a Prior That Depends on
?
In the above scenario,
should ot depend on
. But suppose, for whatever reason, one insists on letting
depend on
.
That still does not mean the posterior will concentrate. Having an estimator that depends on
is necessary, but not sufficient, to get consistency and fast rates. It is not enough to use a prior
that is a function of
. The prior still has to be carefully engineered to ensure that the posterior for
will concentrate around the truth.
Chris hints that he can construct such a prior but does not provide an explicit algorithm nor an argument as to why the estimator would be expected to be locally semiparametric efficient. However, it is simple to construct a
-consistent
locally semiparametric efficient Bayes estimator
as follows.
We tentatively model
as a finite dimensional parametric function 
with either a smooth or noninformative prior on the parameters
, where we take

, and the
are basis functions. Then the posterior mean
of
will have the same asymptotic distribution as the locally semiparametric efficient regression estimator of Scharfstein et al. (1999) described in our original post. Note that the estimator is
consistent, even if the model
is wrong.
Of course, this estimator is a clear case of frequentist pursuit Bayes.
7. Conclusion
Here are the main points:
- If
then the posterior will not concentrate.
Thus, if a Bayesian wants the posterior for
to concentrate around the true value,
he must justify having a prior
that is a function of
.
-
does not imply an absence of selection bias.
Therefore, an argument of the form: “we want selection bias so we cannot have prior independence” fails.
- One can try to argue that prior independence is unrealistic. But as we have shown, this is not the case.
-
But, if after all this, we do insist on letting
depend on
,
it is still not enough. Dependence on
is necessary but not sufficient.
We conclude Bayes fails in our example unless one uses a special prior designed just to mimic the frequentist estimator.
8. Addendum: What happens If The Estimator Does Note Depend on
?
The theorem of Robins and Ritov, quoted in our initial post, says that no uniformly consistent estimator that does not depend on
can exist in the model
which contains all measurable
and
subject
to
with probability 1. Take
for
concreteness. In fact, even when we assume 
and
are quite smooth, there will be little
improvement in performance.
Given
has 100,000 dimensions, we can ask how many derivatives
and
must 
and
have so that it is possible to construct an estimator of
, not depending on
that
converges at rate
uniformly to
over a submodel
. Robins et al. (2008) show that it is necessary and sufficient that
+
and provide an
explicit estimator. More generally, if
+
derivatives with
, the optimal rate is
which is approximately
when
is small compared to
An explicit estimator
is constructed in Robins et al (2008) ; Robins et al (2009) prove that the rate cannot be improved on. Given these asymptotic mathematical results, we
doubt any reader can exhibit an estimator, not depending on
that will have reasonable finite sample performance under
model
or even
with, say, 
and a sample size of 5,000. By reasonable finite sample performance, we mean an interval estimator that will cover the true
at least 95% of the time and that has average length less than or equal to intervals estimators
centered on the
improved HT estimators. Nonetheless, we
await any candidate estimators, accompanied by at least some simulation
evidence backing up your claim.
9. References
- Robins JM, Tchetgen E, Li L, van der Vaart A. (2009).
Semiparametric Minimax Rates. Electron. J. Statist. Volume 3 (2009),
1305-1321.
- Robins JM, Li L, Tchetgen E, van der Vaart A. (2008). Higher order influence
functions and minimax estimation of nonlinear functionals. Probability and
Statistics: Essays in Honor of David A. Freedman 2:335-421
- Robins JM, Ritov Y. (1997). Toward a curse of dimensionality appropriate
(CODA) asymptotic theory for semi-parametric models. Statistics in Medicine,
16:285-319.
You must be logged in to post a comment.