Screening: Everything Old Is New Again
Screening is one of the oldest methods for variable selection. It refers to doing a bunch of marginal (single covariate) regressions instead of one multiple regression. When I was in school, we were taught that it was a bad thing to do.
Now, screening is back in fashion. It’s a whole industry. And before I throw stones, let me admit my own guilt: see Wasserman and Roeder (2009).
1. What Is it?
Suppose that the data are with
To simplify matters, assume that ,
and
. Let us assume that we are in the high dimensional case where
. To perform variable selection, we might use something like the lasso.
But if we use screening, we instead do the following. We regress on
, then we regress
on
, then we regress
on
. In other words, we do
one-dimensional regressions. Denote the regression coefficients by
. We keep the covariates associated with the largest values of
. We then might do a second step such as running the lasso on the covariates that we kept.
What are we actually estimating when we regression on the
covariate? It is easy to see that
where
and is the correlation between
and
.
2. Arguments in Favor of Screening
If you miss an important variable during the screening phase you are in trouble. This will happen if is big but
is small. Can this happen?
Sure. You can certainly find values of the ‘s and the
to make
big and make
small. In fact, you can make
huge while making
. This is sometimes called unfaithfulness in the literature on graphical models.
However, set of vectors that are unfaithful has Lebesgue measure 0. Thus, in some sense, unfaithfulness is “unlikely” and so screening is safe.
3. Arguments Against Screening
Not so fast. In order to screw up, it is not necessary to have exact unfaithfulness. All we need is approximate unfaithfulness. And the set of approximately unfaithful ‘s is a non-trivial subset of
.
But it’s worse than that. Cautious statisticians want procedures that have properties that hold uniformly over the parameter space. Screening cannot be successful in any uniform sense because of the unfaithful (and nearly unfaithful) distributions.
And if we admit that the linear model is surely wrong, then things get even worse.
4. Conclusion
Screening is appealing because it is fast, easy and scalable. But it makes a strong (and unverifiable) assumption that you are not unlucky and have not encountered a case where is small but
is big.
Sometimes I find the arguments in favor of screening to be appealing but when I’m in a more skeptical (sane?) frame of mind, I find screening to be quite unreasonable.
What do you think?
Wasserman, L. and Roeder, K. (2009). High dimensional variable selection. Annals of statistics, 37, 2178.
6 Comments
Screening is probably good if all or most of the correlations are about zero. On one hand that’s an assumption that one would be happy to live without. On the other hand, for the typical d large n small setups discussed so much these days, there is no way around making untestable assumptions like this to replace the data that one should have but hasn’t, so…
It’s basically a sparsity assumption, like all (?) the others.
I’m not sure I understood what exactly the role of faithfulness here. One can certainly have a regression model Y ~ X1 where X1 drops out because Y and X1 are marginally independent, and a model Y ~ {X1, X2} where both variables are kept because X1 and Y are dependent given X2. Probably I’m missing some assumptions there that are used on top of faithfulness (i.e., faithful to what, do begin with?).
Sometimes I found some sparse regression problem setups ill-defined, because it is not clear to me what a “relevant” variable is, since it might depend on what it is that we are conditioning on. Mathematically one can make it well-defined of course, but reading some applied papers I have the impression that practitioners don’t know exactly what they want.
I should probably say “it is like unfaithfulness.”
I am calling a cancellation of the parameters
that gives zero marginal correlation, an “unfaithfulness.”
LW
Since there seems to be no clear way to find a good $|\hat{\alpha}}|$, how about a two stage screening: Stage 1: examine correlations between predictors – throw out highly correlated variables (just one variable per pair of course) and keep track of the correlations; Stage 2: Do the regular screening as described above. Obviously the complexity will go way up with Stage 1, but $d$ should be much smaller for stage two and you can ensure that the choice of $|\hat{\alpha}}|$ is reasonable with respect to the distribution of $\rho_{sj}$.
Screening with marginal correlations is fine as long as the correlations among the predictors are small. If not, then some decorrelation might help. See for example Allen and Tibshirani 2012 ( http://dx.doi.org/10.1111/j.1467-9868.2011.01027.x ) and Zuber and Strimmer 2011 (http://dx.doi.org/10.2202/1544-6115.1730 ).
I even complained about it with grad students even when they did as a warm up exercise prior to doing some multivariate analysis – what does it do for you given the risk of being mislead (Simpsoned or Pearsoned if causality is of even intermediate interest).
Though, admittedly it does initialize some learning about the multivariate distribution of Y and X.