Restricted Isometry Property, Rest In Peace
The restricted isometry property RIP is a critical property in the area of compresses sensing. (See Candes and Tao 2005, Candes, Romberg, and Tao 2006, Donoho 2006). But the RIP assumption has leaked into regression. In this post, I’ll argue that RIP is a theoretical distraction for research in regression.
1. The Birth of RIP
The typical linear regression problem goes like this. We observe where
and has mean 0. Let be the dimension of the covariate .
First suppose that . Recall that the least squares estimator minimizes
Equivalently, it minimizes
where and is the design matrix where each row consists of one .
If (and if some conditions hold), then is invertible, and the least squares estimator is is a consistent estimator of . It is also minimax. And if we appropriately threshold , it is sparsistent (a term invented by Pradeep Ravikumar) which means that
as , where .
When , is not invertible. In this case, the now standard estimator is the lasso estimator, (Tibshirani 1996) which is the value that minimizes
Is consistent or sparsistent? Yes, if we assume the restricted isometry property. Let and let denote the number of nonzero coordinates of . We say that satisfies the RIP, if
for every vector for which .
There are various other conditions besides RIP. Van De Geer and Buhlmann (2009) discuss the relationships between these conditions. But they are all very similar and they all roughly say the same thing: the covariates are nearly independent. Zhao and Yu (2007) and Meinshausen and Yu (2009) show that such conditions are necessary to estimate . This is not surprising. If two covariates are highly correlated, you will never be able to distinguish them.
The important point is, RIP is a very strong assumption.
2. Compressed Sensing Versus Regression
But is RIP a reasonable assumption? It depends on the context. In regression problems, no. In compressed sensing, yes.
In a regression problem, nature hands you random data . For example, is a vector of gene expression levels on subject and is some disease outcome. As anyone who analyzes real data can attest, RIP will not come close to holding. Highly correlated covariates (collinearity) are the rule, not the exception. It makes no sense to assume something like RIP. For that matter, we should not assume that the linear model is correct. That would be crazy. So does not even have a meaning anyway.
Compressed sensing is different. In these applications, the regression refers to data coming from some device. We actually design and build the device in such a way that the linear assumption is true. Moreover, part if the device design is that we get to make up the ‘s (not nature). In particular, we can choose the design matrix. Thus, we can force the RIP to hold. In this engineering applications, linearity and RIP are not just convenient mathematical assumptions. They are design features. They are real.
3. The Lasso Without RIP
We can, however, still make sense of the lasso thanks to a result by Juditsky and Nemirovski (2000) which was later extended by Greenshtein and Ritov (2004). To avoid technical complications, let us assume all the random variables are bounded. Thus, and . We do not assume that is linear. Nonetheless, we can still define the best, sparse linear predictor as follows. For any , define the predictive risk
where is a new pair.
Let minimize subject to . The lasso estimator minimizes subject to where
Then the lasso succeeds, in the sense that
That is, with high probability, the lasso predicts almost as well as the best sparse, linear predictor.
Digression: Note that this is an oracle inequality (or a regret bound). That is, we restrict the set of procedures (sparse linear predictors) but put no restriction on the distribution. A minimax bound here would be useless. The class of all distributions is too large. This is a common situation. You either restrict the class of procedures and seek a regret bound or restrict the class of distributions and get a minimax bound. End Digression.
We get all of this without assuming RIP. We don’t even assume the true regression function is linear. In fact, the result is distribution free. We are basically assuming nothing.
My concern is that there are many statistics and machine learning papers discussing regression (not compressed sensing) that make the RIP assumptions. (For the record, I am guilty.) Similar assumptions pop in other problems such as estimating undirected graphs.
I feel that we (as a field) have gone off in a bad theoretical direction. In regression research we should let the RIP rest in peace.
E. J. Candes and T. Tao. (2005). Decoding by Linear Programming, IEEE Trans. Inf. Th., 51, 4203-4215.
Candes, E.,Romberg, J. and Tao, T. (2006). Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math., 59, 1207-1223.
Donoho, D. L. (2006). Compressed Sensing. IEEE Transactions on Information Theory, 52, 1289-1306.
Greenshtein, E. and Ritov, Y.A. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli, 10, 971-988.
Juditsky, A. and Nemirovski, A. (2000). Functional aggregation for nonparametric regression. Ann. Statist., 28, 681-712.
Meinshausen, N. and Yu, B. (2009). Lasso-type recovery of sparse representations for high-dimensional data. Ann. Statist., 37, 246-270.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., 58, 267-288.
Van De Geer, S.A. and Buhlmann, P. (2009). On the conditions used to prove oracle results for the Lasso. Electronic Journal of Statistics, 3, 1360–1392.
Zhao, P. and Yu, B. (2007). On model selection consistency of Lasso. Journal of Machine Learning Research, 7 2541.