## Screening: Everything Old is New Again

Screening: Everything Old Is New Again

Screening is one of the oldest methods for variable selection. It refers to doing a bunch of marginal (single covariate) regressions instead of one multiple regression. When I was in school, we were taught that it was a bad thing to do.

Now, screening is back in fashion. It’s a whole industry. And before I throw stones, let me admit my own guilt: see Wasserman and Roeder (2009).

1. What Is it?

Suppose that the data are ${(X_1,Y_1),\ldots, (X_n,Y_n)}$ with

$\displaystyle Y_i = \beta_0 + \beta_1 X_{i1} + \cdots + \beta_d X_{id} + \epsilon_i.$

To simplify matters, assume that ${\beta_0=0}$, ${\mathbb{E}(X_{ij})=0}$ and ${{\rm Var}(X_{ij})=1}$. Let us assume that we are in the high dimensional case where ${n < d}$. To perform variable selection, we might use something like the lasso.

But if we use screening, we instead do the following. We regress ${Y}$ on ${X_1}$, then we regress ${Y}$ on ${X_2}$, then we regress ${Y}$ on ${X_3}$. In other words, we do ${d}$ one-dimensional regressions. Denote the regression coefficients by ${\hat\alpha_1,\hat\alpha_2,\ldots}$. We keep the covariates associated with the largest values of ${|\hat\alpha_j|}$. We then might do a second step such as running the lasso on the covariates that we kept.

What are we actually estimating when we regression ${Y}$ on the ${j^{\rm th}}$ covariate? It is easy to see that

$\displaystyle \mathbb{E}(\hat\alpha_j) = \alpha_j$

where

$\displaystyle \alpha_j = \beta_j + \sum_{s\neq j} \beta_s \rho_{sj}$

and ${\rho_{sj}}$ is the correlation between ${X_j}$ and ${X_s}$.

2. Arguments in Favor of Screening

If you miss an important variable during the screening phase you are in trouble. This will happen if ${|\beta_j|}$ is big but ${|\alpha_j|}$ is small. Can this happen?

Sure. You can certainly find values of the ${\beta_j}$‘s and the ${\rho_{js}'s}$ to make ${\beta_j}$ big and make ${\alpha_j}$ small. In fact, you can make ${|\beta_j|}$ huge while making ${\alpha_j=0}$. This is sometimes called unfaithfulness in the literature on graphical models.

However, set of ${\beta}$ vectors that are unfaithful has Lebesgue measure 0. Thus, in some sense, unfaithfulness is “unlikely” and so screening is safe.

3. Arguments Against Screening

Not so fast. In order to screw up, it is not necessary to have exact unfaithfulness. All we need is approximate unfaithfulness. And the set of approximately unfaithful ${\beta}$‘s is a non-trivial subset of ${\mathbb{R}^d}$.

But it’s worse than that. Cautious statisticians want procedures that have properties that hold uniformly over the parameter space. Screening cannot be successful in any uniform sense because of the unfaithful (and nearly unfaithful) distributions.

And if we admit that the linear model is surely wrong, then things get even worse.

4. Conclusion

Screening is appealing because it is fast, easy and scalable. But it makes a strong (and unverifiable) assumption that you are not unlucky and have not encountered a case where ${\alpha_j}$ is small but ${\beta_j}$ is big.

Sometimes I find the arguments in favor of screening to be appealing but when I’m in a more skeptical (sane?) frame of mind, I find screening to be quite unreasonable.

What do you think?

Wasserman, L. and Roeder, K. (2009). High dimensional variable selection. Annals of statistics, 37, 2178.

1. Christian Hennig
Posted September 22, 2012 at 7:49 pm | Permalink

Screening is probably good if all or most of the correlations are about zero. On one hand that’s an assumption that one would be happy to live without. On the other hand, for the typical d large n small setups discussed so much these days, there is no way around making untestable assumptions like this to replace the data that one should have but hasn’t, so…
It’s basically a sparsity assumption, like all (?) the others.

2. Ricardo Silva
Posted September 23, 2012 at 7:45 am | Permalink

I’m not sure I understood what exactly the role of faithfulness here. One can certainly have a regression model Y ~ X1 where X1 drops out because Y and X1 are marginally independent, and a model Y ~ {X1, X2} where both variables are kept because X1 and Y are dependent given X2. Probably I’m missing some assumptions there that are used on top of faithfulness (i.e., faithful to what, do begin with?).

Sometimes I found some sparse regression problem setups ill-defined, because it is not clear to me what a “relevant” variable is, since it might depend on what it is that we are conditioning on. Mathematically one can make it well-defined of course, but reading some applied papers I have the impression that practitioners don’t know exactly what they want.

• Posted September 23, 2012 at 9:12 am | Permalink

I should probably say “it is like unfaithfulness.”
I am calling a cancellation of the parameters
that gives zero marginal correlation, an “unfaithfulness.”

LW

3. Anon
Posted September 24, 2012 at 12:08 pm | Permalink

Since there seems to be no clear way to find a good $|\hat{\alpha}}|$, how about a two stage screening: Stage 1: examine correlations between predictors – throw out highly correlated variables (just one variable per pair of course) and keep track of the correlations; Stage 2: Do the regular screening as described above. Obviously the complexity will go way up with Stage 1, but $d$ should be much smaller for stage two and you can ensure that the choice of $|\hat{\alpha}}|$ is reasonable with respect to the distribution of $\rho_{sj}$.

4. Anon
Posted September 26, 2012 at 5:01 am | Permalink

Screening with marginal correlations is fine as long as the correlations among the predictors are small. If not, then some decorrelation might help. See for example Allen and Tibshirani 2012 ( http://dx.doi.org/10.1111/j.1467-9868.2011.01027.x ) and Zuber and Strimmer 2011 (http://dx.doi.org/10.2202/1544-6115.1730 ).

5. Keith O'Rourke
Posted September 27, 2012 at 11:25 am | Permalink

I even complained about it with grad students even when they did as a warm up exercise prior to doing some multivariate analysis – what does it do for you given the risk of being mislead (Simpsoned or Pearsoned if causality is of even intermediate interest).

Though, admittedly it does initialize some learning about the multivariate distribution of Y and X.