Monthly Archives: September 2012

The Remarkable k-means++

1. The Problem

One of the most popular algorithms for clustering is k-means. We start with {n} data vectors {Y_1,\ldots, Y_n\in \mathbb{R}^d}. We choose {k} vectors — cluster centers — {c_1,\ldots, c_k \in \mathbb{R}^d} to minimize the error

\displaystyle  R(c_1,\ldots,c_k) = \frac{1}{n}\sum_{i=1}^n \min_{1\leq j \leq k} ||Y_i - c_j||^2.

Unfortunately, finding {c_1,\ldots, c_k} to minimize {R(c_1,\ldots,c_k)} is NP-hard. The usual iterative method, \hrefnosnap{http://en.wikipedia.org/wiki/Lloyd is easy to implement but it is unlikely to come close to minimizing the objective function. So finding

\displaystyle  \min_{c_1,\ldots, c_k}R(c_1,\ldots,c_k)

isn’t feasible.

To deal with this, many people choose random starting values, run the {k}-means clustering algorithm then rinse, lather and repeat. In general, this may work poorly and there is no theoretical guarantee of getting close to the minimum. Finding a practical method for approximately minimizing {R} is thus an important practical problem.

2. The Solution

David Arthur and Sergei Vassilvitskii came up with a wonderful solution in 2007 known as k-means++.

The algorithm is simple and comes with a precise theoretical guarantee.

The first step is to choose a data point at random. Call this point {s_1}. Next, compute the squared distances

\displaystyle  D_i^2 = ||Y_i - s_1||^2.

Now choose a second point {s_2} from the data. The probability of choosing {Y_i} is {D_i^2/\sum_j D_j^2}. Now recompute the distance as

\displaystyle  D_i^2 = \min\Bigl\{ ||Y_i - s_1||^2, ||Y_i - s_2||^2 \Bigr\}.

Now choose a third point {s_3} from the data where the probability of choosing {Y_i} is {D_i^2/\sum_j D_j^2}. We continue until we have {k} points {s_1,\ldots,s_k}. Finally, we run {k}-means clustering using {s_1,\ldots,s_k} as starting values. Call the resulting centers {\hat c_1,\ldots, \hat c_k}.

Arthur and Vassilvitskii prove that

\displaystyle  \mathbb{E}[R(\hat c_1,\ldots,\hat c_k)] \leq 8 (\log k +2) \min_{c_1,\ldots, c_k}R(c_1,\ldots,c_k).

The expected value is over the randomness in the algorithm.

There are various improvements to the algorithm, both in terms of computation and in terms of getting a sharper performance bound.

This is quite remarkable. One simple fix, and an intractable problem has become tractable. And the method comes armed with a theorem.

3. Questions

  1. Is there an R implementation? It is easy enough to code the algorithm but it really should be part of the basic k-means function in R.
  2. Is there a version for mixture models? If not, it seems like a paper waiting to be written.
  3. Are there other intractable statistical problems that can be solved using simple randomized algorithms with provable guarantees? (MCMC doesn’t count because there is no finite sample guarantee.)

4. Reference

Arthur, D. and Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. 1027–1035.

Advertisement

Machine Learning In New York

I wanted to let you know that, on October 19, the New York Academy of Sciences is hosting its
7th Annual Machine Learning Symposium.

For details: see here:

link to symposium

Screening: Everything Old is New Again

Screening: Everything Old Is New Again

Screening is one of the oldest methods for variable selection. It refers to doing a bunch of marginal (single covariate) regressions instead of one multiple regression. When I was in school, we were taught that it was a bad thing to do.

Now, screening is back in fashion. It’s a whole industry. And before I throw stones, let me admit my own guilt: see Wasserman and Roeder (2009).

1. What Is it?

Suppose that the data are {(X_1,Y_1),\ldots, (X_n,Y_n)} with

\displaystyle  Y_i = \beta_0 + \beta_1 X_{i1} + \cdots + \beta_d X_{id} + \epsilon_i.

To simplify matters, assume that {\beta_0=0}, {\mathbb{E}(X_{ij})=0} and {{\rm Var}(X_{ij})=1}. Let us assume that we are in the high dimensional case where {n < d}. To perform variable selection, we might use something like the lasso.

But if we use screening, we instead do the following. We regress {Y} on {X_1}, then we regress {Y} on {X_2}, then we regress {Y} on {X_3}. In other words, we do {d} one-dimensional regressions. Denote the regression coefficients by {\hat\alpha_1,\hat\alpha_2,\ldots}. We keep the covariates associated with the largest values of {|\hat\alpha_j|}. We then might do a second step such as running the lasso on the covariates that we kept.

What are we actually estimating when we regression {Y} on the {j^{\rm th}} covariate? It is easy to see that

\displaystyle  \mathbb{E}(\hat\alpha_j) = \alpha_j

where

\displaystyle  \alpha_j = \beta_j + \sum_{s\neq j} \beta_s \rho_{sj}

and {\rho_{sj}} is the correlation between {X_j} and {X_s}.

2. Arguments in Favor of Screening

If you miss an important variable during the screening phase you are in trouble. This will happen if {|\beta_j|} is big but {|\alpha_j|} is small. Can this happen?

Sure. You can certainly find values of the {\beta_j}‘s and the {\rho_{js}'s} to make {\beta_j} big and make {\alpha_j} small. In fact, you can make {|\beta_j|} huge while making {\alpha_j=0}. This is sometimes called unfaithfulness in the literature on graphical models.

However, set of {\beta} vectors that are unfaithful has Lebesgue measure 0. Thus, in some sense, unfaithfulness is “unlikely” and so screening is safe.

3. Arguments Against Screening

Not so fast. In order to screw up, it is not necessary to have exact unfaithfulness. All we need is approximate unfaithfulness. And the set of approximately unfaithful {\beta}‘s is a non-trivial subset of {\mathbb{R}^d}.

But it’s worse than that. Cautious statisticians want procedures that have properties that hold uniformly over the parameter space. Screening cannot be successful in any uniform sense because of the unfaithful (and nearly unfaithful) distributions.

And if we admit that the linear model is surely wrong, then things get even worse.

4. Conclusion

Screening is appealing because it is fast, easy and scalable. But it makes a strong (and unverifiable) assumption that you are not unlucky and have not encountered a case where {\alpha_j} is small but {\beta_j} is big.

Sometimes I find the arguments in favor of screening to be appealing but when I’m in a more skeptical (sane?) frame of mind, I find screening to be quite unreasonable.

What do you think?

Wasserman, L. and Roeder, K. (2009). High dimensional variable selection. Annals of statistics, 37, 2178.

High Dimensional Undirected Graphical Models

High Dimensional Undirected Graphical Models
Larry Wasserman

Graphical models have now become a common tool in applied statistics. Here is a graph representing stock data:

Here is one for proteins: (Maslov and Sneppen, 2002).

And here is one for the voting records of senators: (Banerjee, El Ghaoui, and d’Aspremont ,2008).

Like all statistical methods, estimating graphs is challenging in high dimensional problems.

1. Undirected Graphical Models

Let {X=(X_1,\ldots,X_d)} be a random vector with distribution {P}. A graph {G} for {P} has {d} nodes, or vertices, one for each variable. Some pairs of vertices are connected by edges. We can represent the edges as a set {E} of unordered pairs: {(j,k)\in E} if and only if there is an edge between {X_j} and {X_k}.

We omit an edge between {X_i} and {X_j} to mean that {X_i} and {X_j} are independent, given the other variables which we write as

\displaystyle  X_i \amalg X_j | rest

where “rest” denotes the rest of the variables. For example, in this graph

\displaystyle  X_1 \ \ \rule{2in}{1mm} \ \ X_2 \ \ \rule{2in}{1mm}\ \ X_3

we see that {X_1\amalg X_3 | X_2} since there is not edge between {X_1} and {X_3}.

Now suppose we observe {n} random vectors

\displaystyle  X^{(1)},\ldots, X^{(n)} \sim P.

The goal is to estimate {G} from the data. As an example, imagine that {X^{(i)}} is a vector of gene expression levels for subject {i}. The graph gives us an idea of how the genes are related.

2. The Gaussian Case

Things are easiest if we assume that {P} is a multivariate Gaussian. Suppose {X\sim N(\mu,\Sigma)}. In this case, there is no edge between {X_j} and {X_k} iff {\Omega_{jk}=0} where {\Omega = \Sigma^{-1}}. If the dimension {d} of {X} is smaller than the sample size {n}, then estimating the graph is straightforward. We can use the sample covariance matrix

\displaystyle  S = \frac{1}{n}\sum_{i=1}^n (X^{(i)}-\overline{X})(X^{(i)}-\overline{X})^T

to estimate {\Sigma} and we use {S^{-1}} to estimate {\Omega}. It is then easy to test {H_0: \Omega_{jk}=0}. We put an edge between {j} and {k} if the test rejects. The R package SIN will do this for you.

When {d>n}, the simple approach above won’t work. Perhaps the easiest approach is the method due to Meinshausen and B{ü}hlmann (2006) which John Lafferty has aptly named parallel regression. The idea is this: you regress {X_1} on all the other variables, then you regress {X_2} on all the other variables, etc. For each regression, you use the lasso which yields sparse estimators. When you regress {X_j} on the others, there will be a regression coefficient for {X_k}. Call this {\hat\beta_{jk}}. Similarly, when you regress {X_k} on the others, there will be a regression coefficient for {X_j}. Call this {\hat\beta_{kj}}. If {\hat\beta_{jk}\neq 0} and {\hat\beta_{kj}\neq 0} then you put an edge between {X_j} and {X_k}.

An alternative — the glasso (graphical lasso) — is to maximize the Gaussian log-likelihood {\ell(\mu,\Omega)} with a penalty:

\displaystyle  \ell(\mu,\Omega) + \lambda \sum_{j\neq k} |\Omega_{jk}|.

The resulting estimator {\hat\Omega} is sparse: many of its elements are 0. The non-zeroes denote edges. A fast implementation in R due to Han Liu can be found here.

3. Relaxing Gaussianity

The biggest drawback of the glasso is the assumption of Gaussianity. With my colleagues John Lafferty and Han Liu and others, I have done some work on more nonparametric approaches.

Let me briefly describe the approach in Liu, Xu, Gu, Gupta, Lafferty and Wasserman (2011). The idea is to restrict the graph to be a forest, which is a graph with no cycles.

When {G} is a forest, the density {p} can be written as

\displaystyle  p(y) = \prod_{(j,k)\in E} \frac{p(y_j,y_k)}{p(y_j)p(y_k)}\prod_{s=1}^d p(y_s)

where {E} is the set of edges. We can estimate the density by inserting estimates of the univariate and bivariate marginals:

\displaystyle  \hat{p}(y) = \prod_{(j,k)\in E} \frac{\hat p(y_j,y_k)}{\hat p(y_j)\hat p(y_k)}\prod_{s=1}^d \hat p(y_s).

But to find the graph we need to estimate the edge set {E}.

Any forest {F} defines a density {p_F(y)} by the above equation. If the true density {p} were known, we could choose {F} to minimize the Kullback-Leibler distance

\displaystyle  D(p,p_F) = \int p(y) \log\left(\frac{p(y)}{p_F(y)} \right) dy.

This maximization can be done by Kruskal’s algorithm (Kruskal 1956) also known, in this context, as the Chow-Liu algorithm (Chow and Liu 1968). It works like this.

For each pair {(X_j,X_k)} let

\displaystyle  I(Y_j,Y_k) = \int p(y_j,y_k) \log \frac{p(y_j,y_k)}{p(y_j)p(y_k)} dy_j\, dy_k

be the mutual information between {Y_j} and {Y_k}. We start with an empty tree and add edges greedily according to the value of {I(Y_j,Y_k)}. First we connect the pair of variables with the largest {I(Y_j,Y_k)} and then the second largest and so on. However, we do not add an edge if it forms a cycle.

Of course, we don’t know {p} so, instead, we use the data to estimate the mutual information {I(Y_j,Y_k)}. If we simply run the Chow-Liu algorithm, we get a tree, that is, a fully connected forest. But we also use a hold-out sample to decide on when to stop adding edges. And that’s it.

There are other ways to relax the Gaussian assumption. For example, here is an approach that uses rank statistics and copulas.

4. The Future?

Not too long ago, estimating a high dimensional graph was unthinkable. Now they are used routinely. The biggest thing missing is a good measure of uncertainty. One can do some obvious things, such as bootstrapping, but it is not clear that the output is meaningful.

5. References

Banerjee, O. and El Ghaoui, L. and d’Aspremont, A. (2008). Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. The Journal of Machine Learning Research, 9, 485-516.

Chow, C. and Liu, C. (1968). IEEE Transactions on Information Theory, 14, 462-467.

Kruskal, J.B. (1956). On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical society, 7, 48-50.

Liu, H., Xu, M., Gu, H., Gupta, A., Lafferty, J. and Wasserman, L. (2011). Forest Density Estimation. Journal of Machine Learning Research, 12, 907-951.

Maslov, S. and Sneppen, K. (2002). Specificity and stability in topology of protein networks. Science, 296, 910.

Meinshausen, N. and B{ü}hlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34, 1436-1462.

Hunting for Manifolds

Hunting For Manifolds
Larry Wasserman

In this post I’ll describe one aspect of the problem of estimating a manifold, or manifold learning, as it is called.

Suppose we have data {Y_1,\ldots, Y_n \in \mathbb{R}^D} and suppose that the data lie on, or close to, a manifold {M} of dimension {d< D}. There are two different goals we might have in mind.

The first goal is to use the fact that the data lie near the manifold as a way of doing dimension reduction. The popular methods for solving this problem include isomap, local linear embedding, diffusion maps, and Laplacian eigenmaps among others.

The second goal is to actually estimate the manifold {M}.

I have had the good fortune to work with two groups of people on this problem. With the first group (Chris Genovese, Marco Perone-Pacifico, Isa Verdinelli and me) we have focused mostly on certain geometric aspects of the problem. With the second group (Sivaraman Balakrishnan, Don Sheehy, Aarti Singh, Alessandro Rinaldo and me) we have focused on topology.

It is this geometric problem (my work with the first group) that I will discuss here. In particular, I will focus on these two papers: here and here.

1. Statistical Model

To make some progress, we introduce a statistical model for the data. Here are two models.

Model I (Clutter Noise): With probability {\pi} we draw {Y_i} from a distribution {G} supported on {M}. With probability {1-\pi} we draw {Y_i} from a uniform distribution on some compact set {K}. To be concrete, let’s take {K = [0,1]^D}. The distribution {P} of {Y_i} is

\displaystyle  P = (1-\pi) U + \pi G

where {U} is uniform on {[0,1]^D}.

Model II (Additive Noise): We draw {X_i} from a distribution {G} supported on {M}. Then we let {Y_i = X_i + \epsilon_i} where {\epsilon_i} is a {D}-dimensional Gaussian distribution. In this case, {P} has density {p} given by

\displaystyle  p(y) = \int_M \phi(y-z) dG(z)

where {\phi} is a Gaussian. We can also write this as

\displaystyle  P = G \star \Phi

where {\star} denotes convolution.

There are some interesting complications. First, note that {G} is a singular distribution. That is, there are sets with positive probability but 0 Lebesgue measure. For example, suppose that {M} is a circle in {\mathbb{R}^2} and that {G} is uniform on the circle. Then the circle has Lebesgue measure 0 but {G(M)=1}. Model II has the complication that it is a convolution and convolutions are not easy to deal with.

Our goal is to construct an estimate {\hat M}.

2. Hausdorff Distance

We will compare {\hat M} to {M} using the Hausdorff distance. If {A} and {B} are sets, the Hausdorff distance between {A} and {B} is

\displaystyle  H(A,B) = \inf \Bigl\{ \epsilon:\ A \subset B\oplus \epsilon,\ {\rm and}\ B \subset A\oplus \epsilon\Bigr\}

where

\displaystyle  A\oplus \epsilon = \bigcup_{x\in A} B_D(x,\epsilon)

and {B_D(x,\epsilon)} is a ball of radius {\epsilon} centered at {x}.

Let’s decode what this means. Imagine we start to grow the set {A} by placing a ball of size {\epsilon} around each point in {A}. We keep growing {A} until it engulfs {B}. Let {\epsilon_1} be the size of {\epsilon} needed to do this. Now grow {B} until it engulfs {A}. Let {\epsilon_2} be the size of {\epsilon} needed to do this. Then {H(A,B) = \max\{\epsilon_1,\epsilon_2\}}. Here is an example:

There is another way to think of the Hausdorff distance. If {A} is a set and {x} is a point, let {d_A(x)} be the distance from {x} to {A}:

\displaystyle  d_A(x) = \inf_{y\in A}||y-x||.

We call {d_A} the distance function. Then

\displaystyle  H(A,B) = \sup_x | d_A(x) - d_B(x)|.

Using {H} as our loss function, our goal is to find the minimax risk:

\displaystyle  R_n = \inf_{\hat M}\sup_{P\in {\cal P}} \mathbb{E}_P [ H(\hat M,M)]

where the infimum is over all estimators {\hat M} and the supremum is over a set of distributions {{\cal P}}. Actually, we will consider a large set of manifolds {{\cal M}} and we will associate a distribution {P} with each manifold {M}. This will define the set {{\cal P}}.

3. Reach

We will consider all {d}-dimensional manifolds {M} with positive reach. The reach is a property of a manifold that appears in many places in manifold estimation theory. I believe it first appeared in Federer (1959).

The reach is {\kappa(M)} is the largest number {\epsilon} such that each point in {M\oplus \epsilon} has a unique projection on {M}. A set of non-zero reach has two nice properties. First, it is smooth and second, it is self-avoiding.

Consider a circle {M} in the plane with radius {R} . Every point on the plane has a unique projection on {M} with one exception: the center {c} of the circle is distance {R} from each point on {M}. There is no unique closest point on {M} to {c}. So the reach is {R}.

A plane has infinite reach. A line with corner has 0 reach. Here is a example of a curve {M} in two-dimensions. Suppose you put a line, perpendicular to {M}, at each point on {M}. If the length of the lines is less than reach(M), then the lines won’t cross. If the length of the lines is larger than reach(M), then the lines will cross.


Requiring {M} to have positive reach is a strong condition. There are ways to weaken the condition based on generalization so reach, but we won’t go into those here.

The set of manifolds we are interested in is the set of {d}-manifolds contained in {[0,1]^D} with reach at least {\kappa >0}. Here, {\kappa} is some arbitrary, fixed positive number.

4. The Results

Recall that that the minimax risk is

\displaystyle  R_n = \inf_{\hat M}\sup_{P\in {\cal P}} \mathbb{E}_P [ H(\hat M,M)].

In the case of clutter noise it turns out that

\displaystyle  R_n \approx \left(\frac{1}{n}\right)^{\frac{2}{d}}.

Note that the rate depends on {d} but not on the ambient dimension {D}. But in the case of additive Gaussian noise,

\displaystyle  R_n \approx \left(\frac{1}{\log n}\right).

The latter result says that {R_n} approaches 0 very slowly. For all practical purposes, it is not possible to estimate {M}. This might be surprising. But, in fact, the result is not unexpected.

We can explain this by analogy with another problem. Suppose you want to estimate {p} from {n} observations {Y_1,\ldots, Y_n}. if {p} satisfies standard smoothness assumptions, then the minimax risk has the form

\displaystyle  \left(\frac{1}{n}\right)^{\frac{4}{4+d}}

which is a nice, healthy, polynomial rate. But if we add some Gaussian noise to each observation, the minimax rate of convergence for estimating {p} is logarithmic (Fan 1991). The same thing happens if we add Gaussian noise to the covariates in a regression problem. A little bit of Gaussian noise makes these problems essentially impossible.

Is there a way out of this? Yes. Instead of estimating {M} we estimate a set {M'} that is close to {M}. In other words, we have to live with some bias. We can estimate {M'}, which we call a surrogate for {M}, with a polynomial rate.

We are finishing a paper on this idea and I’ll write a post about it when we are done.

5. Conclusion

Estimating a manifold is a difficult statistical problem. With additive Gaussian noise, it is nearly impossible. And notice that I assumed that the dimension {d} of the manifold was known. Things are even harder when {d} is not known.

However, none of these problems are insurmountable as long as we chnage the goal from “estimating {M}” to “estimating an approximation of {M}” which, in practice, is usually quite reasonable.

6. References

Fan, J. (1991). On the optimal rates of convergence for nonparametric deconvolution problems. Ann. Statist. 19 1257-1272.

Federer, H. (1959). Curvature measures. Trans. Amer. Math. Soc. 93 418-491.

Genovese, C., Perone-Pacifico, M., Verdinelli, I. and Wasserman, L. (2012). Minimax Manifold Estimation. Journal of Machine Learning Research, 13, 1263–1291.

Genovese, C., Perone-Pacifico, M., Verdinelli, I. and Wasserman, L. (2012). Manifold estimation and singular deconvolution under Hausdorff loss. The Annals of Statistics, 49, 941-963.

Robins and Wasserman Respond to a Nobel Prize Winner Continued: A Counterexample to Bayesian Inference?

Robins and Wasserman Respond to a Nobel Prize Winner Continued: A Counterexample to Bayesian Inference?

This is a response to Chris Sims’ comments on our previous blog post. Because of the length of our response, we are making this a new post rather than putting it in the comments of the last post.

Recall that we observe {n} iid observations {O=\left( X,R,RY\right)}, where {Y} and {R} are Bernoulli and independent given {X}.
Define {\theta \left( X\right) \equiv E \left[ Y|X\right]} and {\pi \left( X\right) \equiv E\left[ R|X\right]}. We assume that {\pi \left( \cdot \right) } is a known function. Also the marginal
density {p\left( x\right)} of {X=\left( X_{1},...,X_{d}\right)} (with {d=100,000}) is known and uniform on the unit cube in {R^{d}}. Our goal is estimation of

\displaystyle   \psi \equiv E\left[ Y\right] =E\left\{ E\left[ Y|X\right] \right\} =  E\left\{E\left[ Y|X,R=1\right] \right\} =\int_{[0,1]^d} \theta \left( x\right) dx.

The likelihood

\displaystyle   \prod_{i=1}^{n}p(X_{i})p(R_{i}|X_{i})p(Y_{i}|X_{i})^{R_{i}}=\left\{  \prod_{i}\pi (X_{i})^{R_{i}}(1-\pi (X_{i}))^{1-R_{i}}\right\} \left\{  \prod_{i}\,\theta (X_{i})^{Y_{i}R_{i}}(1-\theta  (X_{i}))^{(1-Y_{i})R_{i}}\right\} \

factors into two parts – the first depending on {\pi \left( \cdot \right)} and the second on {\theta \left( \cdot \right)}.

1. Selection Bias

This is a point of agreement. There is selection bias if and only if

\displaystyle   Cov\left\{ \theta \left(X\right) ,\pi \left( X\right) \right\}=0.

Note that

\displaystyle   E\left[ Y\right] =E\left[ Y|R=1\right] -  \frac{Cov\left\{ \theta \left(X\right) ,\pi \left( X\right) \right\} }{E[R]}.

Hence, if {Cov\left\{ \theta \left(X\right) ,\pi \left( X\right) \right\}=0} then
the sample average of {Y} in the subset whose {Y} is observed (R=1) is unbiased for

\displaystyle   \psi \equiv E\left[ Y\right] =\int_{[0,1]^d}\ \theta \left( x\right) dx.

In this case, inference is easy for the Bayesian and the frequentist and there is no issue. So we all agree that the interesting case is where there is selection bias,
that is, where {Cov\left\{ \theta \left(X\right) ,\pi \left( X\right) \right\} \neq 0.}

2. Posterior Dependence on {\pi}

If the prior {W} on the functions {\pi(\cdot)} and {\theta(\cdot)} is such that {W(\pi,\theta)= W(\pi)W(\theta)} then the posterior does not depend on {\pi} and the posterior for {\psi} will not concentrate around the true value of {\psi}. Again, we believe we all agree on this point.

We note that no one, Bayesian or frequentist, has ever proposed using an estimator that does not depend on {\pi \left( \cdot \right) } in the selection bias case, i.e when {Cov\left\{ \theta \left( X\right) ,\pi\left( X\right) \right\} } is non-zero. (See addendum for more on this point.)

3. Prior Independence Versus Selection Bias

Reading Chris’ comments, the reader might get the impression that prior independence rules out selection bias, i.e. that is,

\displaystyle W(\pi,\theta) =W(\pi)W(\theta)\ \ \ \ {\rm implies \ that\ }\ \ \   Cov\left\{ \theta \left(X\right) ,\pi \left( X\right) \right\}=0.

Therefore, one might conclude that if we want to discuss the interesting case where there is selection bias, then we cannot have {W(\pi,\theta) =W(\pi)W(\theta)}.

But this is incorrect. {W(\pi,\theta) =W(\pi)W(\theta)} does not imply that {Cov\left\{ \theta \left(X\right) ,\pi \left( X\right) \right\}=0.} To see this, consider the following example.

Suppose that {X} is one dimensional and a Bayesian’s prior {W} for {\left( \theta \left( \cdot \right) ,\pi \left(\cdot \right) \right)} depends only on the two parameters {\left( \alpha_{\theta },\alpha _{\pi }\right) } as follows:

\displaystyle   \theta \left( x\right) =\alpha _{\theta }x,\ \ \   \pi \left( x\right) =\alpha _{\pi}x+1/10 \ \ \   \text{with}\ \alpha _{\theta }\text{ and }\alpha _{\pi }\text{ a\ priori\ independent, }

where {\alpha _{\theta }} is uniform on {\left( 0,1\right)} and {\alpha_{\pi }} is uniform on (0,9/10).
Then, clearly {\theta\left( \cdot \right)} and {\pi \left( \cdot \right)} are independent under {W}. However, recalling {X} is uniform so {p\left( x\right) \equiv 1,} we have that for for any fixed {\left( \alpha _{\theta },\alpha _{\pi }\right)},

\displaystyle   Cov\left\{ \theta \left( X\right) ,\pi \left( X\right) \right\}  =\int_{0}^{1}\theta \left( x\right) \pi \left( x\right) dx-\int_{0}^{1}\ \pi  \left( x\right) dx\int_{0}^{1}\theta \left( x\right) dx \\  =\alpha _{\theta }\alpha _{\pi }\left( \int_{0}^{1}x^{2}dx-\left\{  \int_{0}^{1}xdx\right\} ^{2}\right) =\alpha _{\theta }\alpha _{\pi }/12 .

Hence

\displaystyle   W({\rm there\ exists\ selection\ bias})=W\Biggl( Cov\left\{ \theta \left( X\right) ,\pi \left( X\right) \right\} >0 \Biggr) =1

since {\alpha _{\theta }} and {\alpha_{\pi }} are both positive with {W-}probability 1.

4. Other Justifications For Prior Dependence?

Since prior independence of {\pi} and {\theta} does not imply “no selection bias,” one might instead argue that it is practically unrealistic to have {W(\theta,\pi)=W(\theta)W(\pi)}. But we now show that it is realistic.

Suppose a new HMO needs to estimate the fraction {\psi } of its patient population that will have a MI {(Y)} in the next year, so as to determine the number of cardiac unit beds needed. Each HMO member has had 300 potential risk factors {X=(X_{1},...,X_{300})} measured: age, weight height, blood pressure, multiple tests of liver, renal, pulmonary, and cardiac function, good and bad cholesterol, packs per day smoked, years smoked, etc. (We will get to 100,000 once routine genomic testing becomes feasible). A general epidemiologist had earlier studied risk factors for MI
by following 5000 of the 50,000 HMO members for a year. Because MI is a rare event, he oversampled subjects whose {X}, in his opinion, indicated a
smaller probability {\theta \left( X\right) } of an MI ({Y=1)}. Hence the
sampling fraction {\pi \left( X\right) =P\left( R=1|X\right) } was a known, but complex function chosen so as to try to make {\theta \left( X\right) } and {\pi \left( X\right) } negatively correlated.

The world’s leading heart expert, our Bayesian, was hired to estimate {\psi =\int \theta \left( x\right) p\left( x\right) dx} based on distribution { p\left( x\right) } of {X} in HMO members and the data {\left(X_{i},R_{i},R_{i}Y_{i}\right) ,i=\left( 1,...,5000\right) } from the study.
As world’s expert, his beliefs about the risk function {\theta \left( \cdot \right) } would not change upon learning {\pi \left( \cdot \right) ,} as { \pi \left( \cdot \right) } only reflects a nonexpert’s beliefs. Hence { \theta \left( \cdot \right) } and {\pi \left( \cdot \right) } are a priori independent. Nonetheless, knowing that the epidemiologist had carefully read the expert literature on risk factors for MI, he also believes with high probability that epidemiologist succeeded in having the random variables { \theta \left( X\right) } and {\pi \left( X\right) } be negatively correlated.

What’s more, Robins and Ritov (1997) showed that, if before seeing the data, any Bayesian, cardiac expert or not, thoroughly queries the epidemiologist
(who selected {\pi \left( \cdot \right) }) about the epidemiologist’s reasoned opinions concerning {\theta (\cdot )} (but not about {\pi (\cdot )} ), the Bayesian will then have independent priors. The idea is that once you are satisfied that you have learned from the epidemiologist all he knows about {\theta (\cdot )} that you did not, you will have an updated prior for
{\theta \left( \cdot \right) }. Your prior for {\theta \left( \cdot \right)  \ (}now updated) cannot then change if you subsequently are told {\pi \left(  \cdot \right) .} Hence, we could take as many Bayesians as you please and arrange it so all had {\theta \left( \cdot \right) } and {\pi \left( \cdot \right) } apriori independent. This last argument is quite general, applying to many settings.

5. Alternative Interpretation

An alternative reading of Chris’s third response and his subsequent post is that, rather than placing a joint prior {W} over the functions
{\theta\left( \cdot \right) =\left\{ \theta \left( x\right) ;x\in \left[ 0,1\right]^{d}\right\} } and {\pi \left( \cdot \right) =\left\{ \pi \left( x\right);x\in \left[ 0,1\right] ^{d}\right\}} as above,
his prior is placed over the joint distribution of the random variables {\theta \left( X\right) } and {\pi \left( X\right)}.
If so, he is then correct that making {\theta \left(X\right) } and {\pi \left( X\right) } independent with prior probability one
also implies {Cov\left\{ \theta \left( X\ \right) ,\pi \left( X\ \right)\right\} =0} and thus no selection bias.

However, it appears that from this, he concludes that selection bias, in itself, licenses the dependence of his posterior on {\pi \left( \cdot \right)}.
This is incorrect. As noted above, it is prior dependence of {\theta \left( \cdot \right) } and {\pi\left( \cdot \right) } that licenses posterior dependence on {\pi \left(\cdot \right)} – not prior dependence of {\theta \left( X\right) } and
{\pi\left( X\right)}. Were he correct, our Bayesian cardiac expert’s prior on {\theta \left( \cdot \right)} could have changed upon learning the epidemiologists {\pi \left( \cdot \right)}.

6. What If We Do Use a Prior That Depends on {\pi}?

In the above scenario, {W(\theta)} should ot depend on {\pi}. But suppose, for whatever reason, one insists on letting {W(\theta)} depend on {\pi}.

That still does not mean the posterior will concentrate. Having an estimator that depends on {\pi} is necessary, but not sufficient, to get consistency and fast rates. It is not enough to use a prior {W(\theta)} that is a function of {\pi}. The prior still has to be carefully engineered to ensure that the posterior for {\psi} will concentrate around the truth.

Chris hints that he can construct such a prior but does not provide an explicit algorithm nor an argument as to why the estimator would be expected to be locally semiparametric efficient. However, it is simple to construct a {n^{1/2}}-consistent
locally semiparametric efficient Bayes estimator {\hat{\psi}_{Bayes}} as follows.

We tentatively model {\theta(x) =P(Y=1|X=x)} as a finite dimensional parametric function {b\left( x;\eta_{1},\ldots ,\eta_{k},\omega \right) }
with either a smooth or noninformative prior on the parameters {\left( \eta_{1},\ldots ,\eta_{k},\omega \right)}, where we take

\displaystyle  b\left( x;\eta_{1},\ldots ,\eta _{k},\omega \right) = \mathrm{expit}\left( \sum_{m=1}^{k}\eta_{m}\phi _{m}(x)+\frac{\omega }{\pi (x)}\right) ,

{\mathrm{expit}(a)=e^{a}/(1+e^{a})}, and the {\phi _{m}\left( x\right)} are basis functions. Then the posterior mean
{\hat{\psi}_{\rm Bayes}} of {\psi =\int \theta \left( x\right) dx} will have the same asymptotic distribution as the locally semiparametric efficient regression estimator of Scharfstein et al. (1999) described in our original post. Note that the estimator is {n^{1/2}} consistent, even if the model {\theta(x) =P(Y=1|X=x)=b\left( x;\eta_{1},\ldots ,\eta_{k},\omega \right)} is wrong.

Of course, this estimator is a clear case of frequentist pursuit Bayes.

7. Conclusion

Here are the main points:

  1. If {W(\theta,\pi) = W(\theta)W(\pi)} then the posterior will not concentrate.
    Thus, if a Bayesian wants the posterior for {\psi} to concentrate around the true value,
    he must justify having a prior {W(\theta)} that is a function of {\pi}.

  2. {W(\theta,\pi) = W(\theta)W(\pi)} does not imply an absence of selection bias.
    Therefore, an argument of the form: “we want selection bias so we cannot have prior independence” fails.

  3. One can try to argue that prior independence is unrealistic. But as we have shown, this is not the case.
  4. But, if after all this, we do insist on letting {W(\theta)} depend on {\pi},
    it is still not enough. Dependence on {\pi} is necessary but not sufficient.

We conclude Bayes fails in our example unless one uses a special prior designed just to mimic the frequentist estimator.

8. Addendum: What happens If The Estimator Does Note Depend on {\pi}?

The theorem of Robins and Ritov, quoted in our initial post, says that no uniformly consistent estimator that does not depend on {\pi  \left( \cdot \right) } can exist in the model {\mathcal{P}} which contains all measurable {\pi \left( x\right) } and {\theta \left( x\right) } subject
to {\pi \left( X\right) >\delta >0} with probability 1. Take {\delta =1/8} for
concreteness. In fact, even when we assume {\theta \left( \cdot \right) }
and {\pi \left( \cdot \right) } are quite smooth, there will be little
improvement in performance.

Given {X} has 100,000 dimensions, we can ask how many derivatives {\beta _{\theta }} and {\beta _{\pi }} must {\theta \left( \cdot \right) }
and {\pi \left( \cdot \right) } have so that it is possible to construct an estimator of {\psi }, not depending on {\pi \left( \cdot \right) ,} that
converges at rate {n^{-\frac{1}{2}}}uniformly to {\psi } over a submodel { \mathcal{P}_{smooth}}. Robins et al. (2008) show that it is necessary and sufficient that {\beta _{\theta }}+{\beta _{\pi }=50,000} and provide an
explicit estimator. More generally, if {\beta _{\theta }} +{\beta _{\pi }=s} derivatives with {0<s<50,000} , the optimal rate is {n^{- \frac{s/50,000}{\left( 1+s/50,000\right) }}} which is approximately { n^{-s/50,000}} when {s} is small compared to {50,000.} An explicit estimator
is constructed in Robins et al (2008) ; Robins et al (2009) prove that the rate cannot be improved on. Given these asymptotic mathematical results, we
doubt any reader can exhibit an estimator, not depending on {\pi \left(\cdot \right) ,} that will have reasonable finite sample performance under
model {\mathcal{P}} or even {\mathcal{P}_{smooth}} with, say, {s=25,000}
and a sample size of 5,000. By reasonable finite sample performance, we mean an interval estimator that will cover the true {\psi } at least 95% of the time and that has average length less than or equal to intervals estimators
centered on the {n^{1/2}-consistent} improved HT estimators. Nonetheless, we
await any candidate estimators, accompanied by at least some simulation
evidence backing up your claim.

9. References

  1. Robins JM, Tchetgen E, Li L, van der Vaart A. (2009).
    Semiparametric Minimax Rates. Electron. J. Statist. Volume 3 (2009),
    1305-1321.

  2. Robins JM, Li L, Tchetgen E, van der Vaart A. (2008). Higher order influence
    functions and minimax estimation of nonlinear functionals. Probability and
    Statistics: Essays in Honor of David A. Freedman 2:335-421

  3. Robins JM, Ritov Y. (1997). Toward a curse of dimensionality appropriate
    (CODA) asymptotic theory for semi-parametric models. Statistics in Medicine,
    16:285-319.