Robins and Wasserman Respond to a Nobel Prize Winner
James Robins and Larry Wasserman
Chris Sims is a Nobel prize winning economist who is well known for his work on macroeconomics, Bayesian statistics, vector autoregressions among other things. One of us (LW) had the good fortune to meet Chris at a conference and can attest that he is also a very nice guy.
Chris has a paper called On an An Example of Larry Wasserman. This post is a response to Chris’ paper.
The example in question is actually due to Robins and Ritov (1997). A simplified version appeared in Wasserman (2004) and Robins and Wasserman (2000). The example is related to ideas from the foundations of survey sampling (Basu 1969, Godambe and Thompson 1976) and also to ancillarity paradoxes (Brown 1990, Foster and George 1996).
1. The Model
Here is (a version of) the example. Consider iid random variables
The random variables take values as follows:
Think of as being very, very large. For example, and .
The idea is this: we observe . Then we flip a biased coin . If then you get to see . If then you don’t get to see . The goal is to estimate
Here are the details. The distribution takes the form
Note that and are independent, given . For simplicity, we will take to be uniform on . Next, let
where is a function. That is, . Of course,
where is a function. That is, . Of course,
The function is known. We construct it. Remember that is the probability that we get to observe given that . Think of as something that is expensive to measure. We don’t always want to measure it. So we make a random decision about whether to measure it. And we let the probability of measuring be a function of . And we get to construct this function.
Let be a known, small, positive number. We will assume that
for all .
The only thing in the the model we don’t know is the function . Again, we will assume that
Let denote all measurable functions on that satisfy the above conditions. The parameter space is the set of functions .
Let be the set of joint distributions of the form
where , and and satisfy the conditions above. So far, we are considering the sub-model in which is known.
The parameter of interest is . We can write this as
Hence, is a function of . If we know then we can compute .
2. Frequentist Analysis
The usual frequentist estimator is the Horwitz-Thompson estimator
It is easy to verify that is unbiased and consistent. Furthermore, . In fact, let us define
It follows from Hoeffding’s inequality that
Thus we have a finite sample, confidence interval with length .
Remark: We are mentioning the Horwitz-Thompson estimator because it is simple. In practice, it has three deficiencies:
- It may exceed 1.
- It ignores data on the multivariate vector except for the one dimensional summary .
- It can be very inefficient.
These problems are remedied by using an improved version of the Horwitz-Thompson estimator. One choice is the so-called locally semiparametric efficient regression estimator (Scharfstein et al., 1999):
where , are basis functions, and are the mle’s (among subjects with ) in the model
Here can increase slowly with Recently even more efficient estimators have been derived. Rotnitzky et al (2012) provides a review. In the rest of this post, when we refer to the Horwitz-Thompson estimator, the reader should think “improved Horwitz-Thompson estimator.” End Remark.
3. Bayesian Analysis
To do a Bayesian analysis, we put some prior on . Next we compute the likelihood function. The likelihood for one observation takes the form . The reason for having in the exponent is that, if , then is not observed so the gets left out. The likelihood for observations is
where we used the fact that . But remember, is known. In other words, is known. So, the likelihood is
Combining this likelihood with the prior creates a posterior distribution on which we will denote by . Since the parameter of interest is a function of , the posterior for defines a posterior distribution for .
Now comes the interesting part. The likelihood has essentially no information in it.
To see that the likelihood has no information, consider a simpler case where is a function on . Now discretize the interval into many small bins. Let be the number of bins. We can then replace the function with a high-dimensional vector . With , most bins are empty. The data contain no information for most of the ‘s. (You might wonder about the effect of putting a smoothness assumption on . We’ll discuss this in Section 4.)
We should point out that if for all , then Ericson (1969) showed that a certain exchangeable prior gives a posterior that, like the Horwitz-Thompson estimator, converges at rate . However we are interested in the case where is a complex function of ; then the posterior will fail to concentrate around the true value of . On the other hand, a flexible nonparametric prior will have a posterior essentially equal to the prior and, thus, not concentrate around , whenever the prior does not depend on the the known function . Indeed, we have the following theorem from Robins and Ritov (1997):
Theorem. (Robins and Ritov 1997). Any estimator that is not a function of cannot be uniformly consistent.
This means that, at no finite sample size, will an estimator that is not a function of be close to for all distributions in . In fact, the theorem holds for a neighborhood around every pair . Uniformity is important because it links asymptotic behavior to finite sample behavior. But when is known and is used in the estimator (as in the Horwitz-Thompson estimator and its improved versions) we can have uniform consistency.
Note that a Bayesian will ignore since the are just constants in the likelihood. There is an exception: the Bayesian can make the posterior be a function of by choosing a prior that makes depend on . But this seems very forced. Indeed, Robins and Ritov showed that, under certain conditions, any true subjective Bayesian prior must be independent of . Specifically, they showed that once a subjective Bayesian queries the randomizer (who selected ) about the randomizer’s reasoned opinions concerning (but not ) the Bayesian will have independent priors. We note that a Bayesian can have independent priors even when he believes with probabilty 1 that and are positively correlated as functions of i.e. Having independent priors only means that learning will not change one’s beliefs about . So far, so good. As far as we know, Chris agrees with everything up to this point.
4. Some Bayesian Responses
Chris goes on to raise alternative Bayesian approaches.
The first is to define
Note that . Now we ignore (throw away) the original data. Chris shows that we can then construct a model for which results in a posterior for that mimics the Horwitz-Thompson estimator. We’ll comment on this below, but note two strange things. First, it is odd for a Bayesian to throw away data. Second, the new data are a function of which forces the posterior to be a function of . But as we noted earlier, when and are a priori independent, the do not appear in the posterior since they are known constants that drop out of the likelihood.
A second approach (not mentioned explicitly by Chris) which is related to the above idea, is to construct a prior that depends on the known function . It can be shown that if the prior is chosen just right then again the posterior for mimics the (improved) Horwitz-Thompson estimator.
Lastly, Chris notes that the posterior contains no information because we have not enforced any smoothness on . Without smoothness, knowing does not tell you anything about (assuming the prior does not depend on ).
This is true and better inferences would obtain if we used a prior that enforced smoothness. But this argument falls apart when is large. (In fairness to Chris, he was referring to the version from Wasserman (2004) which did not invoke high dimensions.) When is large, forcing to be smooth does not help unless you make it very, very, very smooth. The larger is, the more smoothness you need to get borrowing of information across different values of . But this introduces a huge bias which precludes uniform consistency.
5. Response to the Response
We have seen that response 3 (add smoothness conditions in the prior) doesn’t work. What about response 1 and response 2? We agree that these work, in the sense that the Bayes answer has good frequentist behavior by mimicking the (improved) Horwitz-Thompson estimator.
But this is a Pyrrhic victory. If we manipulate the data to get a posterior that mimics the frequentist answer, is this really a success for Bayesian inference? Is it really Bayesian inference at all? Similarly, if we choose a carefully constructed prior just to mimic a frequentist answer, is it really Bayesian inference?
We call Bayesian inference which is carefully manipulated to force an answer with good frequentist behavior, frequentist pursuit. There is nothing wrong with it, but why bother?
If you want good frequentist properties just use the frequentist estimator. If you want to be a Bayesian, be a Bayesian but accept the fact that, in this example, your posterior will fail to concentrate around the true value.
In summary, we agree with Chris’ analysis. But his fix is just frequentist pursuit; it is Bayesian analysis with unnatural manipulations aimed only at forcing the Bayesian answer to be the frequentist answer. This seems to us to be an admission that Bayes fails in this example.
Basu, D. (1969). Role of the Sufficiency and Likelihood Principles in Sample Survey Theory. Sankya, 31, 441-454.
Brown, L.D. (1990). An ancillarity paradox which appears in multiple linear regression. The Annals of Statistics, 18, 471-493.
Ericson, W.A. (1969). Subjective Bayesian models in sampling finite populations. Journal of the Royal Statistical Society. Series B, 195-233.
Foster, D.P. and George, E.I. (1996). A simple ancillarity paradox. Scandinavian journal of statistics, 233-242.
Godambe, V. P., and Thompson, M. E. (1976), Philosophy of Survey-Sampling Practice. In Foundations of Probability Theory, Statistical Inference and Statistical Theories of Science, eds. W.L.Harper and A.Hooker, Dordrecht: Reidel.
Robins, J.M. and Ritov, Y. (1997). Toward a Curse of Dimensionality Appropriate (CODA) Asymptotic Theory for Semi-parametric Models. Statistics in Medicine, 16, 285–319.
Robins, J. and Wasserman, L. (2000). Conditioning, likelihood, and coherence: a review of some foundational concepts. Journal of the American Statistical Association, 95, 1340-1346.
Rotnitzky, A., Lei, Q., Sued, M. and Robins, J.M. (2012). Improved double-robust estimation in missing data and causal inference models. Biometrika, 99, 439-456.
Scharfstein, D.O., Rotnitzky, A. and Robins, J.M. (1999). Adjusting for nonignorable drop-out using semiparametric nonresponse models. Journal of the American Statistical Association, 1096-1120.
Sims, Christopher. On An Example of Larry Wasserman. Available at: http://www.princeton.edu/~sims/.
Wasserman, L. (2004). All of Statistics: a Concise Course in Statistical Inference. Springer Verlag.