Robins and Wasserman Respond to a Nobel Prize Winner Continued: A Counterexample to Bayesian Inference?

Robins and Wasserman Respond to a Nobel Prize Winner Continued: A Counterexample to Bayesian Inference?

This is a response to Chris Sims’ comments on our previous blog post. Because of the length of our response, we are making this a new post rather than putting it in the comments of the last post.

Recall that we observe {n} iid observations {O=\left( X,R,RY\right)}, where {Y} and {R} are Bernoulli and independent given {X}.
Define {\theta \left( X\right) \equiv E \left[ Y|X\right]} and {\pi \left( X\right) \equiv E\left[ R|X\right]}. We assume that {\pi \left( \cdot \right) } is a known function. Also the marginal
density {p\left( x\right)} of {X=\left( X_{1},...,X_{d}\right)} (with {d=100,000}) is known and uniform on the unit cube in {R^{d}}. Our goal is estimation of

\displaystyle   \psi \equiv E\left[ Y\right] =E\left\{ E\left[ Y|X\right] \right\} =  E\left\{E\left[ Y|X,R=1\right] \right\} =\int_{[0,1]^d} \theta \left( x\right) dx.

The likelihood

\displaystyle   \prod_{i=1}^{n}p(X_{i})p(R_{i}|X_{i})p(Y_{i}|X_{i})^{R_{i}}=\left\{  \prod_{i}\pi (X_{i})^{R_{i}}(1-\pi (X_{i}))^{1-R_{i}}\right\} \left\{  \prod_{i}\,\theta (X_{i})^{Y_{i}R_{i}}(1-\theta  (X_{i}))^{(1-Y_{i})R_{i}}\right\} \

factors into two parts – the first depending on {\pi \left( \cdot \right)} and the second on {\theta \left( \cdot \right)}.

1. Selection Bias

This is a point of agreement. There is selection bias if and only if

\displaystyle   Cov\left\{ \theta \left(X\right) ,\pi \left( X\right) \right\}=0.

Note that

\displaystyle   E\left[ Y\right] =E\left[ Y|R=1\right] -  \frac{Cov\left\{ \theta \left(X\right) ,\pi \left( X\right) \right\} }{E[R]}.

Hence, if {Cov\left\{ \theta \left(X\right) ,\pi \left( X\right) \right\}=0} then
the sample average of {Y} in the subset whose {Y} is observed (R=1) is unbiased for

\displaystyle   \psi \equiv E\left[ Y\right] =\int_{[0,1]^d}\ \theta \left( x\right) dx.

In this case, inference is easy for the Bayesian and the frequentist and there is no issue. So we all agree that the interesting case is where there is selection bias,
that is, where {Cov\left\{ \theta \left(X\right) ,\pi \left( X\right) \right\} \neq 0.}

2. Posterior Dependence on {\pi}

If the prior {W} on the functions {\pi(\cdot)} and {\theta(\cdot)} is such that {W(\pi,\theta)= W(\pi)W(\theta)} then the posterior does not depend on {\pi} and the posterior for {\psi} will not concentrate around the true value of {\psi}. Again, we believe we all agree on this point.

We note that no one, Bayesian or frequentist, has ever proposed using an estimator that does not depend on {\pi \left( \cdot \right) } in the selection bias case, i.e when {Cov\left\{ \theta \left( X\right) ,\pi\left( X\right) \right\} } is non-zero. (See addendum for more on this point.)

3. Prior Independence Versus Selection Bias

Reading Chris’ comments, the reader might get the impression that prior independence rules out selection bias, i.e. that is,

\displaystyle W(\pi,\theta) =W(\pi)W(\theta)\ \ \ \ {\rm implies \ that\ }\ \ \   Cov\left\{ \theta \left(X\right) ,\pi \left( X\right) \right\}=0.

Therefore, one might conclude that if we want to discuss the interesting case where there is selection bias, then we cannot have {W(\pi,\theta) =W(\pi)W(\theta)}.

But this is incorrect. {W(\pi,\theta) =W(\pi)W(\theta)} does not imply that {Cov\left\{ \theta \left(X\right) ,\pi \left( X\right) \right\}=0.} To see this, consider the following example.

Suppose that {X} is one dimensional and a Bayesian’s prior {W} for {\left( \theta \left( \cdot \right) ,\pi \left(\cdot \right) \right)} depends only on the two parameters {\left( \alpha_{\theta },\alpha _{\pi }\right) } as follows:

\displaystyle   \theta \left( x\right) =\alpha _{\theta }x,\ \ \   \pi \left( x\right) =\alpha _{\pi}x+1/10 \ \ \   \text{with}\ \alpha _{\theta }\text{ and }\alpha _{\pi }\text{ a\ priori\ independent, }

where {\alpha _{\theta }} is uniform on {\left( 0,1\right)} and {\alpha_{\pi }} is uniform on (0,9/10).
Then, clearly {\theta\left( \cdot \right)} and {\pi \left( \cdot \right)} are independent under {W}. However, recalling {X} is uniform so {p\left( x\right) \equiv 1,} we have that for for any fixed {\left( \alpha _{\theta },\alpha _{\pi }\right)},

\displaystyle   Cov\left\{ \theta \left( X\right) ,\pi \left( X\right) \right\}  =\int_{0}^{1}\theta \left( x\right) \pi \left( x\right) dx-\int_{0}^{1}\ \pi  \left( x\right) dx\int_{0}^{1}\theta \left( x\right) dx \\  =\alpha _{\theta }\alpha _{\pi }\left( \int_{0}^{1}x^{2}dx-\left\{  \int_{0}^{1}xdx\right\} ^{2}\right) =\alpha _{\theta }\alpha _{\pi }/12 .

Hence

\displaystyle   W({\rm there\ exists\ selection\ bias})=W\Biggl( Cov\left\{ \theta \left( X\right) ,\pi \left( X\right) \right\} >0 \Biggr) =1

since {\alpha _{\theta }} and {\alpha_{\pi }} are both positive with {W-}probability 1.

4. Other Justifications For Prior Dependence?

Since prior independence of {\pi} and {\theta} does not imply “no selection bias,” one might instead argue that it is practically unrealistic to have {W(\theta,\pi)=W(\theta)W(\pi)}. But we now show that it is realistic.

Suppose a new HMO needs to estimate the fraction {\psi } of its patient population that will have a MI {(Y)} in the next year, so as to determine the number of cardiac unit beds needed. Each HMO member has had 300 potential risk factors {X=(X_{1},...,X_{300})} measured: age, weight height, blood pressure, multiple tests of liver, renal, pulmonary, and cardiac function, good and bad cholesterol, packs per day smoked, years smoked, etc. (We will get to 100,000 once routine genomic testing becomes feasible). A general epidemiologist had earlier studied risk factors for MI
by following 5000 of the 50,000 HMO members for a year. Because MI is a rare event, he oversampled subjects whose {X}, in his opinion, indicated a
smaller probability {\theta \left( X\right) } of an MI ({Y=1)}. Hence the
sampling fraction {\pi \left( X\right) =P\left( R=1|X\right) } was a known, but complex function chosen so as to try to make {\theta \left( X\right) } and {\pi \left( X\right) } negatively correlated.

The world’s leading heart expert, our Bayesian, was hired to estimate {\psi =\int \theta \left( x\right) p\left( x\right) dx} based on distribution { p\left( x\right) } of {X} in HMO members and the data {\left(X_{i},R_{i},R_{i}Y_{i}\right) ,i=\left( 1,...,5000\right) } from the study.
As world’s expert, his beliefs about the risk function {\theta \left( \cdot \right) } would not change upon learning {\pi \left( \cdot \right) ,} as { \pi \left( \cdot \right) } only reflects a nonexpert’s beliefs. Hence { \theta \left( \cdot \right) } and {\pi \left( \cdot \right) } are a priori independent. Nonetheless, knowing that the epidemiologist had carefully read the expert literature on risk factors for MI, he also believes with high probability that epidemiologist succeeded in having the random variables { \theta \left( X\right) } and {\pi \left( X\right) } be negatively correlated.

What’s more, Robins and Ritov (1997) showed that, if before seeing the data, any Bayesian, cardiac expert or not, thoroughly queries the epidemiologist
(who selected {\pi \left( \cdot \right) }) about the epidemiologist’s reasoned opinions concerning {\theta (\cdot )} (but not about {\pi (\cdot )} ), the Bayesian will then have independent priors. The idea is that once you are satisfied that you have learned from the epidemiologist all he knows about {\theta (\cdot )} that you did not, you will have an updated prior for
{\theta \left( \cdot \right) }. Your prior for {\theta \left( \cdot \right)  \ (}now updated) cannot then change if you subsequently are told {\pi \left(  \cdot \right) .} Hence, we could take as many Bayesians as you please and arrange it so all had {\theta \left( \cdot \right) } and {\pi \left( \cdot \right) } apriori independent. This last argument is quite general, applying to many settings.

5. Alternative Interpretation

An alternative reading of Chris’s third response and his subsequent post is that, rather than placing a joint prior {W} over the functions
{\theta\left( \cdot \right) =\left\{ \theta \left( x\right) ;x\in \left[ 0,1\right]^{d}\right\} } and {\pi \left( \cdot \right) =\left\{ \pi \left( x\right);x\in \left[ 0,1\right] ^{d}\right\}} as above,
his prior is placed over the joint distribution of the random variables {\theta \left( X\right) } and {\pi \left( X\right)}.
If so, he is then correct that making {\theta \left(X\right) } and {\pi \left( X\right) } independent with prior probability one
also implies {Cov\left\{ \theta \left( X\ \right) ,\pi \left( X\ \right)\right\} =0} and thus no selection bias.

However, it appears that from this, he concludes that selection bias, in itself, licenses the dependence of his posterior on {\pi \left( \cdot \right)}.
This is incorrect. As noted above, it is prior dependence of {\theta \left( \cdot \right) } and {\pi\left( \cdot \right) } that licenses posterior dependence on {\pi \left(\cdot \right)} – not prior dependence of {\theta \left( X\right) } and
{\pi\left( X\right)}. Were he correct, our Bayesian cardiac expert’s prior on {\theta \left( \cdot \right)} could have changed upon learning the epidemiologists {\pi \left( \cdot \right)}.

6. What If We Do Use a Prior That Depends on {\pi}?

In the above scenario, {W(\theta)} should ot depend on {\pi}. But suppose, for whatever reason, one insists on letting {W(\theta)} depend on {\pi}.

That still does not mean the posterior will concentrate. Having an estimator that depends on {\pi} is necessary, but not sufficient, to get consistency and fast rates. It is not enough to use a prior {W(\theta)} that is a function of {\pi}. The prior still has to be carefully engineered to ensure that the posterior for {\psi} will concentrate around the truth.

Chris hints that he can construct such a prior but does not provide an explicit algorithm nor an argument as to why the estimator would be expected to be locally semiparametric efficient. However, it is simple to construct a {n^{1/2}}-consistent
locally semiparametric efficient Bayes estimator {\hat{\psi}_{Bayes}} as follows.

We tentatively model {\theta(x) =P(Y=1|X=x)} as a finite dimensional parametric function {b\left( x;\eta_{1},\ldots ,\eta_{k},\omega \right) }
with either a smooth or noninformative prior on the parameters {\left( \eta_{1},\ldots ,\eta_{k},\omega \right)}, where we take

\displaystyle  b\left( x;\eta_{1},\ldots ,\eta _{k},\omega \right) = \mathrm{expit}\left( \sum_{m=1}^{k}\eta_{m}\phi _{m}(x)+\frac{\omega }{\pi (x)}\right) ,

{\mathrm{expit}(a)=e^{a}/(1+e^{a})}, and the {\phi _{m}\left( x\right)} are basis functions. Then the posterior mean
{\hat{\psi}_{\rm Bayes}} of {\psi =\int \theta \left( x\right) dx} will have the same asymptotic distribution as the locally semiparametric efficient regression estimator of Scharfstein et al. (1999) described in our original post. Note that the estimator is {n^{1/2}} consistent, even if the model {\theta(x) =P(Y=1|X=x)=b\left( x;\eta_{1},\ldots ,\eta_{k},\omega \right)} is wrong.

Of course, this estimator is a clear case of frequentist pursuit Bayes.

7. Conclusion

Here are the main points:

  1. If {W(\theta,\pi) = W(\theta)W(\pi)} then the posterior will not concentrate.
    Thus, if a Bayesian wants the posterior for {\psi} to concentrate around the true value,
    he must justify having a prior {W(\theta)} that is a function of {\pi}.

  2. {W(\theta,\pi) = W(\theta)W(\pi)} does not imply an absence of selection bias.
    Therefore, an argument of the form: “we want selection bias so we cannot have prior independence” fails.

  3. One can try to argue that prior independence is unrealistic. But as we have shown, this is not the case.
  4. But, if after all this, we do insist on letting {W(\theta)} depend on {\pi},
    it is still not enough. Dependence on {\pi} is necessary but not sufficient.

We conclude Bayes fails in our example unless one uses a special prior designed just to mimic the frequentist estimator.

8. Addendum: What happens If The Estimator Does Note Depend on {\pi}?

The theorem of Robins and Ritov, quoted in our initial post, says that no uniformly consistent estimator that does not depend on {\pi  \left( \cdot \right) } can exist in the model {\mathcal{P}} which contains all measurable {\pi \left( x\right) } and {\theta \left( x\right) } subject
to {\pi \left( X\right) >\delta >0} with probability 1. Take {\delta =1/8} for
concreteness. In fact, even when we assume {\theta \left( \cdot \right) }
and {\pi \left( \cdot \right) } are quite smooth, there will be little
improvement in performance.

Given {X} has 100,000 dimensions, we can ask how many derivatives {\beta _{\theta }} and {\beta _{\pi }} must {\theta \left( \cdot \right) }
and {\pi \left( \cdot \right) } have so that it is possible to construct an estimator of {\psi }, not depending on {\pi \left( \cdot \right) ,} that
converges at rate {n^{-\frac{1}{2}}}uniformly to {\psi } over a submodel { \mathcal{P}_{smooth}}. Robins et al. (2008) show that it is necessary and sufficient that {\beta _{\theta }}+{\beta _{\pi }=50,000} and provide an
explicit estimator. More generally, if {\beta _{\theta }} +{\beta _{\pi }=s} derivatives with {0<s<50,000} , the optimal rate is {n^{- \frac{s/50,000}{\left( 1+s/50,000\right) }}} which is approximately { n^{-s/50,000}} when {s} is small compared to {50,000.} An explicit estimator
is constructed in Robins et al (2008) ; Robins et al (2009) prove that the rate cannot be improved on. Given these asymptotic mathematical results, we
doubt any reader can exhibit an estimator, not depending on {\pi \left(\cdot \right) ,} that will have reasonable finite sample performance under
model {\mathcal{P}} or even {\mathcal{P}_{smooth}} with, say, {s=25,000}
and a sample size of 5,000. By reasonable finite sample performance, we mean an interval estimator that will cover the true {\psi } at least 95% of the time and that has average length less than or equal to intervals estimators
centered on the {n^{1/2}-consistent} improved HT estimators. Nonetheless, we
await any candidate estimators, accompanied by at least some simulation
evidence backing up your claim.

9. References

  1. Robins JM, Tchetgen E, Li L, van der Vaart A. (2009).
    Semiparametric Minimax Rates. Electron. J. Statist. Volume 3 (2009),
    1305-1321.

  2. Robins JM, Li L, Tchetgen E, van der Vaart A. (2008). Higher order influence
    functions and minimax estimation of nonlinear functionals. Probability and
    Statistics: Essays in Honor of David A. Freedman 2:335-421

  3. Robins JM, Ritov Y. (1997). Toward a curse of dimensionality appropriate
    (CODA) asymptotic theory for semi-parametric models. Statistics in Medicine,
    16:285-319.

13 Comments

  1. Konrad
    Posted September 2, 2012 at 7:09 pm | Permalink

    I’m disappointed that you didn’t address the questions I asked in response to the previous post. I tracked down the abstract of the Robins-Ritov paper, but the paper itself is behind a paywall. The abstract does not mention the concentration result – could you state it? Does it refer to the rate of convergence of the point estimate to the true value? For the Bayesian inference, which point estimate are you referring to? When you say “the posterior will fail to concentrate”, are you referring to the posterior itself or a point estimate (the posterior mean?) derived from the posterior?

    In underspecified problems such as this one, it is generally better to estimate intervals rather than work with point estimates – due to the vast underspecification, a point estimate is almost guaranteed to be far from the true value, but an interval estimate may nonetheless be informative and useful in practice; for instance, hypothesis testing (which essentially relies on interval rather than point estimation) is possible and useful even in overparameterized problems where point estimates are unreliable. Does the concentration result extend in some way to interval estimation?

    The Robins-Ritov abstract (like the original post here) refers to properties that estimators must have in order to be _uniformly_ consistent, so I’ll repeat my earlier question: why do you restrict your attention to estimators with this property? Might a pointwise consistent estimator not have better convergence rate?

    • Posted September 3, 2012 at 8:37 am | Permalink

      Yes, in the original post we showed how to get a confidence interval that shrinks
      at rate n^{-1/2}. Again, this is possible because of uniform consistency.
      —LW

  2. Posted September 2, 2012 at 7:20 pm | Permalink

    I do intend to reply. It took us several days to compose our reply to Chris.
    Robins and Ritov can be found here

    Click to access coda.pdf

    –LW

  3. Posted September 2, 2012 at 9:38 pm | Permalink

    Reply to Konrad

    Most of your questions are answered in the references, especially the
    Robins-Ritov paper. It would be difficult to discuss all the
    technical details in a blog post so I urge you to read the original
    papers.

    The main points are these: the Robins-Ritov paper proves that any
    estimator that is not a function of pi will not concentrate around the
    true value. (More precisely, it can only concentrate extremely
    slowly). This includes the Bayes estimator since the pi(X_i) terms
    drop out of the likelihood (they are know constants).

    On the other hand, the Horwitz-Thompson estimator (and its improved
    version) are proved to concentrate uniformly at a 1/sqrt{n} rate.

    > You point out that the likelihood function contains very little
    > information – this is as it should be and provides a sanity check for
    > likelihood-based approaches – since there is almost no information
    > available about theta, any approach that claims to have access to such
    > information should be distrusted.

    But the likelihood isn’t the only source of information. The
    likelihood ignores the randomization probabilities which are in fact
    very informative.

    > That said, your argument for the non-informativeness of the
    > likelihood function is based on the absence of smoothness
    > constraints. You then claim that introducing smoothness
    > constraints when the problem is high dimensional will not help
    > unless you make the function “very, very, very smooth”- but you
    > do not quantify this. The thing to be quantified is not how much
    > we know about theta (which remains very little), but how much we
    > know about psi (after theta has been marginalised out). On what
    > do you base the claim that “we have seen that response 3 doesn’t
    > work” when you have made no attempt to quantify how well it
    > actually does work or how it compares to your proposed solution?

    This is addressed in the addendum (Section 8) to our most recent post.
    The amount of smoothness you nee to assume grows quickly with dimension.

    > Re priors: given the structure of the problem, and if interesting
    > prior information about theta were available to be incorporated into
    > the model, it is very feasible that such information would be highly
    > dependent on pi – if we imagine this is an experimental setup which
    > has been run before and from which we have drawn previous qualitative
    > conclusions, those conclusions will have been much more informative
    > when pi(x) was mostly large than when it was mostly small, and it
    > would make sense to use a prior that is correspondingly tighter for
    > larger pi.

    Dependence of the prior on pi is necessary but not sufficient to have
    the Bayes estimator concentrate around the true value. The prior
    needs to be very carefully engineered to get concentration.

    > But I agree that one wants results that make sense even when the
    > prior is independent of pi (from an objective Bayesian point of
    > view, the prior is just another part of the model specification,
    > which ought to be specified by the person posing the problem;
    > also when we really know nothing about the system there doesn’t
    > seem to be any reason to use a prior that depends on pi).

    Agreed.

    > Finally, it is not clear how desirable uniform consistency (or
    > any one specific frequentist metric) is, if (for instance) it is
    > obtained at the expense of efficiency. Is a uniformly consistent
    > estimator necessarily better than a pointwise consistent
    > estimator? (In my limited understanding the main advantage of
    > having uniform consistency is that it allows one to establish
    > worst case guarantees on confidence interval size for finite
    > sample sizes – this is not irrelevant, but one might be more
    > interested in optimizing expected rather than worst case error.)

    We view uniform consistency as vital. With pointwise consistency, we
    can only say that there is some sample size n at which the estimator
    becomes accurate to within, say, epsilon. But this n depends on the
    unknown theta. With uniform consisency, n depends only on epsilon.
    More importantly, without uniform consistency, it’s not possible to
    construct a finite sample confidence interval.

    I hope these comments help.
    —LW

  4. Posted September 2, 2012 at 9:44 pm | Permalink

    Reply to Enstophy:

    You’re right to point out that Bayesians will see probabilities and
    frequencies as diferent (although linked via deFinetti’s theorem).
    Nonetheless, we consider it reasonable to ask about the frequency
    behavior of posteriors probability distributions. Perhaps it would
    clearer if we said: one’s posterior beliefs will fail to concentrate
    around the truth, in the frequency sense.

    I didn’t find your argument that W(theta) should be a function of pi to
    be convincing. But anyway, dependence on pi is not enough.
    It is necessary but not sufficient.
    —LW

  5. Cyan
    Posted September 3, 2012 at 3:19 am | Permalink

    There’s a justification for priors carefully engineered to yield uniformly consistent posterior distributions that does not depend on frequentist pursuit per se: it is Solomonoff-induction pursuit. Solomonoff induction does not contradict Bayes and the Solomonoff posterior predictive distribution converges to the sampling distribution with sampling probability one; furthermore, this convergence is at the fastest possible rate. Alas, it is also uncomputable.

    If one takes the view that one’s prior ought to be a computable approximation to the Solomonoff prior, these kinds of Freedmanesque inconsistency arguments against Bayes don’t actually militate against Bayes. They are in fact incredibly useful — they show that vast swaths of the space of prior probability distributions can be disregarded, since they do not contain computable approximations to the Solomonoff prior.

    • Posted September 3, 2012 at 8:32 am | Permalink

      This sounds very interesting.
      I would like to see a prof that in this example
      it yields an estimator that is uniformly
      n^{1/2} consistent.

      —LW

  6. Posted September 3, 2012 at 1:34 pm | Permalink

    I’ve posted a round 4 now. I show that the simple example in the last post by Robins and Wasserman does not make the point that it claims to make. The arguments they want to make do depend fundamentally on infinite dimensionality, I think, and I should try to look at the Robins-Ritov reference and respond directly to that. But, for a while, maybe days, I have other stuff to get done. All my posts are still at http://sims.princeton.edu/yftp/WassermanExmpl.

    • Keith O'Rourke
      Posted September 6, 2012 at 8:20 am | Permalink

      For the problem to arise, it seems theta(x.i) must be not smooth enough for theta(x.i) ~ theta(x.j) i != j for any (or at least most) i , j where R = 1 (where Y is observed), the interest must be in psy, the E[Y] a _uniform_ expectation over [0,1]^d and pi(x) must be both non-informative (given x) and non-uniform from [0,1]^d when R=1. There does not seem to be a problem about the posterior of theta(x.i) which is simply a mixture of the prior of theta(x.i|x.i.obs,R=1,Y) or prior of theta(x.i|x.i.obs,R=0 )~ prior of theta(x.i|x.i.obs) (i.e. a mixture of posterior and prior with most of it being prior). The problem arises integrating this posterior over [0,1]^d for psy as the target and it is not clear in the blog post how this is done. Simply collapsing the posterior over x.i.obs[just where R=1] (non uniform) would seem very wrong
      .
      This would explain why pi(x)=1/2 for all x does not cause any problem to arise, x.i.obs will be uniform and why a prior with all the mass on a linear function for theta(x.i) (or any combination of polynomials that is linear in x.i) will not cause a problem – the parameter(s) are the same (e.g. alpha) anywhere in [0,1]^d and a non-uniform sample from [0,1]^d does not create problem if large enough (non-singular).

      As for the comment “If you want to be a Bayesian, be a Bayesian but accept the fact that, in this example, your posterior will fail to concentrate around the true value.” given the study design analogy of purposely choosing to sample with pi(x.i), for the objective of estimating psi, given non-smooth theta(x.i), the design is flawed for the usual Bayesian analysis and it perhaps should not be surprising that a fix is not easy to come up with. HT is designed to fix just this problem, and the Bayesian should perhaps not feel bad pronouncing that they can do nothing Bayesian for the patient except to pronounce them dead.

      So this goes in my bin of stuff that is ignorable for the practicing statistician (unlike Neyman-Scott).

      Or I may still not unstand their example well enough….

  7. Stefan Harmeling
    Posted September 18, 2012 at 8:01 am | Permalink

    Hi there,

    in 2007, Marc Toussaint and I wrote a Technical Report about “Bayesian
    estimators for Robins-Ritov’s problem” [http://eprints.pascal-network.org/archive/00003871/01/harmeling-toussaint-07-ritov.pdf] which includes simulations
    and which also concludes that the critical point is the dependence
    or independence of theta and pi. In Section 3 and 4, we considered the
    setting X.i ~ uniform(1…C), R.i ~ bernoulli(pi.i), Y.i ~ N(R.i *
    theta.i, 1).

    (i) In Section 3 we derive a Bayesian estimator that does not assume
    dependence between theta and pi (in our report xi). A simulation
    shows that this Bayesian estimator has a smaller variance than the
    Horwitz-Thompson (HT) estimator on data that has been generated with
    independent theta and pi. Thus for such data the Bayesian estimator
    has no problems (even has lower variance).

    (ii) In Section 4 we assume that theta and pi are dependent. To
    derive a Bayesian estimator we have to model this dependence. Of
    course there are many possibilities: we choose a dependency that
    relates to the dependency used in Robins and Ritov (1997) in their
    proof that a likelihood-based estimator can not be uniformly unbiased.
    The Bayesian estimator looks quite similar to the HT estimator.
    However, on simulated data that follows this model the Bayesian
    estimator has again a lower variance. So also for the dependent case,
    the Bayesian estimator derived from the model assumptions (dependence
    of theta and pi) works. Curiously, it also weights the samples
    similar to the HT estimator.

    (iii) Section 5 shows that similar arguments hold for continuous X.

    My conclusion has three points (which might be obvious by now):

    (1) The HT estimator works only good on data where theta and pi are
    dependent. The advantage of the HT estimator might be that this
    dependence does not have to be made explicit.

    (2) If the dependence between theta and pi can be made explicit, we
    can derive a Bayesian estimator which works as well as the HT
    estimator (possibly with lower variance). The disadvantage of the
    Bayesian approach might be that the dependence has to be made
    explicit.

    (3) The third point is a question: can we exploit a possible
    dependence between theta and pi in a Bayesian estimator without making
    it explicit?

    I’d be curious to hear the experts’ opinion on these thoughts! Thanks!

    • Posted September 18, 2012 at 9:28 am | Permalink

      will take a look and get back to you
      LW

    • Keith O'Rourke
      Posted September 19, 2012 at 1:40 pm | Permalink

      Interesting, though you might be interesting in looking at Neyman-Scott discussion in Barndorff-Nielsen, O., and Cox, D. R. Inference and asymptotics. Chapman and
      Hall, London, 1994.

      Where they argue that, essentially the approaches to salvage the likelihood
      separate into two, one is to … and the second replaces the specification of arbitrary
      non-commonness of the non-common parameter with a common distribution for that parameter [i.e. latent generative model].

      Still unclear to me, what compels one to take certain summaries of the posterior over others…

  8. Posted October 8, 2012 at 8:59 pm | Permalink

    I’ve posted a (final?) comment on this at the same place as the earlier one (sims.princeton.edu/yftp/WassermanExmpl). It’s the WassermanR4a.pdf file. It repeats some of what I’ve said earlier, but tries to build up the discussion from simple examples to the continuous case.