RIP RIP (Restricted Isometry Property, Rest In Peace)

R.I.P. R.I.P.
Restricted Isometry Property, Rest In Peace
Larry Wasserman

The restricted isometry property RIP is a critical property in the area of compresses sensing. (See Candes and Tao 2005, Candes, Romberg, and Tao 2006, Donoho 2006). But the RIP assumption has leaked into regression. In this post, I’ll argue that RIP is a theoretical distraction for research in regression.

1. The Birth of RIP

The typical linear regression problem goes like this. We observe {(X_1,Y_1),\ldots, (X_n,Y_n)} where

\displaystyle  Y_i = \beta^T X_i + \epsilon_i

and {\epsilon_i} has mean 0. Let {d} be the dimension of the covariate {X_i=(X_{i1},\ldots, X_{id})}.

First suppose that {d<n}. Recall that the least squares estimator {\hat\beta} minimizes

\displaystyle  \sum_i (Y_i - \beta^T X_i)^2.

Equivalently, it minimizes

\displaystyle  ||\mathbb{Y} - \mathbb{X}\beta||^2

where {\mathbb{Y} =(Y_1,\ldots,Y_n)} and {\mathbb{X}} is the {n\times d} design matrix where each row consists of one {X_i}.

If {d < n} (and if some conditions hold), then {\mathbb{X}^T\mathbb{X}} is invertible, and the least squares estimator is {\hat\beta = (\mathbb{X}^T\mathbb{X})^{-1}\mathbb{X}^T\mathbb{Y}} is a consistent estimator of {\beta}. It is also minimax. And if we appropriately threshold {\hat\beta}, it is sparsistent (a term invented by Pradeep Ravikumar) which means that

\displaystyle  P({\rm support}(\hat\beta)= {\rm support}(\beta)) \rightarrow 1

as {n\rightarrow \infty}, where {{\rm support}(\beta) = \{j:\ \beta_j \neq 0\}}.

When {d>n}, {\mathbb{X}^T\mathbb{X}} is not invertible. In this case, the now standard estimator is the lasso estimator, (Tibshirani 1996) which is the value {\hat\beta} that minimizes

\displaystyle  \sum_i (Y_i - \beta^T X_i)^2 + \lambda ||\beta||_1 = ||\mathbb{Y} - \mathbb{X}\beta||^2 + \lambda ||\beta||_1

where {||\beta||_1 = \sum_{j=1}^d |\beta|_j}.

Is {\hat\beta} consistent or sparsistent? Yes, if we assume the restricted isometry property. Let {s< d} and let {||\beta||_0} denote the number of nonzero coordinates of {\beta}. We say that {\mathbb{X}} satisfies the {\delta_s} RIP, if

\displaystyle  (1-\delta_s) ||\beta||^2 \leq ||\mathbb{X}\beta ||^2 \leq (1+\delta_s) ||\beta||^2

for every vector {\beta} for which {||\beta||_0 \leq s}.

There are various other conditions besides RIP. Van De Geer and Buhlmann (2009) discuss the relationships between these conditions. But they are all very similar and they all roughly say the same thing: the covariates are nearly independent. Zhao and Yu (2007) and Meinshausen and Yu (2009) show that such conditions are necessary to estimate {\beta}. This is not surprising. If two covariates are highly correlated, you will never be able to distinguish them.

The important point is, RIP is a very strong assumption.

2. Compressed Sensing Versus Regression

But is RIP a reasonable assumption? It depends on the context. In regression problems, no. In compressed sensing, yes.

In a regression problem, nature hands you random data {(X_1,Y_1),\ldots, (X_n,Y_n)}. For example, {X_i} is a vector of gene expression levels on subject {i} and {Y_i} is some disease outcome. As anyone who analyzes real data can attest, RIP will not come close to holding. Highly correlated covariates (collinearity) are the rule, not the exception. It makes no sense to assume something like RIP. For that matter, we should not assume that the linear model {Y_i = \beta^T X_i + \epsilon_i} is correct. That would be crazy. So {\beta} does not even have a meaning anyway.

Compressed sensing is different. In these applications, the regression {Y_i = \beta^T X_i + \epsilon} refers to data coming from some device. We actually design and build the device in such a way that the linear assumption is true. Moreover, part if the device design is that we get to make up the {X_i}‘s (not nature). In particular, we can choose the design matrix. Thus, we can force the RIP to hold. In this engineering applications, linearity and RIP are not just convenient mathematical assumptions. They are design features. They are real.

3. The Lasso Without RIP

We can, however, still make sense of the lasso thanks to a result by Juditsky and Nemirovski (2000) which was later extended by Greenshtein and Ritov (2004). To avoid technical complications, let us assume all the random variables are bounded. Thus, {|Y_i| \leq B} and {|X_{ij}| \leq B}. We do not assume that {\mathbb{E}(Y|X=x)} is linear. Nonetheless, we can still define the best, sparse linear predictor {\beta_*} as follows. For any {\beta}, define the predictive risk

\displaystyle  R(\beta) = \mathbb{E}(Y - \beta^T X)^2

where {(X,Y)} is a new pair.

Let {\beta_*} minimize {R(\beta)} subject to {||\beta||_1 \leq L}. The lasso estimator {\hat\beta} minimizes {\hat R(\beta)} subject to {||\beta||_1 \leq L} where

\displaystyle  \hat R(\beta) = \frac{1}{n}\sum_{i=1}^n (Y_i - X_i^T\beta)^2.

Then the lasso succeeds, in the sense that

\displaystyle  P(R(\hat\beta) > R(\beta_*) + \epsilon) \leq \exp\left( -\left[ \frac{n\epsilon^2}{2B^4(1+L)^4} - \log 2 - 2 \log d\right]\right).

That is, with high probability, the lasso predicts almost as well as the best sparse, linear predictor.

Digression: Note that this is an oracle inequality (or a regret bound). That is, we restrict the set of procedures (sparse linear predictors) but put no restriction on the distribution. A minimax bound here would be useless. The class of all distributions is too large. This is a common situation. You either restrict the class of procedures and seek a regret bound or restrict the class of distributions and get a minimax bound. End Digression.

We get all of this without assuming RIP. We don’t even assume the true regression function is linear. In fact, the result is distribution free. We are basically assuming nothing.

4. Conclusion

My concern is that there are many statistics and machine learning papers discussing regression (not compressed sensing) that make the RIP assumptions. (For the record, I am guilty.) Similar assumptions pop in other problems such as estimating undirected graphs.

I feel that we (as a field) have gone off in a bad theoretical direction. In regression research we should let the RIP rest in peace.

5. References

E. J. Candes and T. Tao. (2005). Decoding by Linear Programming, IEEE Trans. Inf. Th., 51, 4203-4215.

Candes, E.,Romberg, J. and Tao, T. (2006). Stable signal recovery from incomplete and inaccurate measurements. Comm. Pure Appl. Math., 59, 1207-1223.

Donoho, D. L. (2006). Compressed Sensing. IEEE Transactions on Information Theory, 52, 1289-1306.

Greenshtein, E. and Ritov, Y.A. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli, 10, 971-988.

Juditsky, A. and Nemirovski, A. (2000). Functional aggregation for nonparametric regression. Ann. Statist., 28, 681-712.

Meinshausen, N. and Yu, B. (2009). Lasso-type recovery of sparse representations for high-dimensional data. Ann. Statist., 37, 246-270.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., 58, 267-288.

Van De Geer, S.A. and Buhlmann, P. (2009). On the conditions used to prove oracle results for the Lasso. Electronic Journal of Statistics, 3, 1360–1392.

Zhao, P. and Yu, B. (2007). On model selection consistency of Lasso. Journal of Machine Learning Research, 7 2541.

About these ads

5 Comments

  1. Posted August 7, 2012 at 10:57 am | Permalink

    Consider the minimax bound that can be obtained from the bound in the second part when $\beta$ is in the unit box, $[-1,1]^d$. We get that w.p. $1-\delta$, the leading term in the bound on the excess risk is $B^2(1+d)^2\sqrt{\log(1/\delta)/n}$. My understanding was that the excess risk, under identical conditions *and RIP*, can be bounded by a similar quantity except that the factor involving $d$ can be replaced by a polylogarithmic one.

    • Posted August 7, 2012 at 11:02 am | Permalink

      You might be right.
      I wrote this in a hotel room
      and derived it on some scrap paper.
      What I wrote might be a bit off
      —LW

  2. Posted August 7, 2012 at 1:11 pm | Permalink

    Larry,

    Within Compressed Sensing, RIP is not a condition for sparse recovery that is in fashion anymore. In particular since about 2008 and the work of Donoho and Tanner [1], we have empirical proofs that in general this condition is too stringent. The generic Donoho-Tanner phase transition generally shows that l_1 recovery of sparse vectors admit less sparse vectors than what RIP would require.

    In a certain sense, the “coup de grace” was given to that argument recently with a phase transition even beyond the traditional Donoho-Tanner PT (see [2]) and featured in the work of Krzakala et al [3]. All this can be easily pictured in this beautiful hand crafted drawing I made for the occasion: http://goo.gl/H3m4C :)
    The work of Krzakala et al [3] show that by choosing the right design matrix, one can attain very interesting limits. Right, within the community, there is an on-going discussion as to whether we ought to put some constraint on the design matrix or push further the constraint on the vector to be recovered. On the one hand, it means, we need to design different hardware, on the other, it means we need to dig further in the structured sparsity world that the ML folks are investigating. An example of that discussion can be found in [4] and [5].

    But let me play the devil’s advocate for RIP, at least within compressive sensing. RIP was for a long while the only mathematically rigorous reasoning at hand for people to show that their empirical finding held even if in most of these papers, the authors acknowledged that their design matrices did not fit the RIP condition. To a certain extent, I can bet you could not even get a paper out unless the first few paragraphs talked about the solid reason as to why sparse recovery was possible (thanks to RIP). We have also seen many authors coming up with different conditions paralleling RIP, so it was a useful tool for a while.

    There is also another aspect as to why RIP is interesting: It is a similar tool that is currently used to investigate low rank type of algorithms. Here again, I am sure that most phase transition discovered there will show that RIP is too stringent, but again it may the only theoretical tool around grounded in some theory.

    Cheers,

    Igor.

    [1] Observed universality of phase transitions in high-dimensional geometry, with implications for modern data analysis and signal processing by Jared Tanner and David L. Donoho http://www.maths.ed.ac.uk/~tanner/DoTa_Universality.pdf
    [2] Compressive Sensing: The Big Picture, https://sites.google.com/site/igorcarron2/cs#measurement
    [3] Statistical physics-based reconstruction in compressed sensing by Florent Krzakala, Marc Mézard, François Sausset, Yifan Sun, Lenka Zdeborová, http://arxiv.org/abs/1109.4424
    [4] http://nuit-blanche.blogspot.com/2012/08/intra-block-correlation-and-sparse.html
    [5] http://nuit-blanche.blogspot.com/2012/08/more-intra-block-correlation-and-sparse.html

    • Posted August 7, 2012 at 1:22 pm | Permalink

      Thanks for the comment Igor.

      Perhaps I focused too much on RIP.
      My real point was that ANY conditions needed for recovery
      are unrealistic for vanilla regression problems.
      To put it another way, recovery is (usually) not an interesting problem
      in ordinary regression.
      (As a pedantic example, take two covariates to be perfectly
      correlated. You can’t separate the two but you don’t need to
      if you focus on predictive accuracy.)
      In other words, regression and compressed sensing are very
      different problems (despite the obvious overlap).
      But the difference often gets ignored in some statistics papers.

      Thanks for the discussion and references

      Larry

  3. Mario Souto
    Posted March 6, 2014 at 4:27 pm | Permalink

    Larry,

    I agree with you when it concerns to forecasting the output y. On [1] Candès points out that coherence can be ignored if the goal is to recover Xb and not b.
    However, on many cases regression is used to explain the output from the exogenous variables. On that case, I believe X should obey the RIP (or any equivalent condition) in order to preserve the interpretability of the variable selection.

    [1] – Candes, Emmanuel J., et al. “Compressed sensing with coherent and redundant dictionaries.” Applied and Computational Harmonic Analysis 31.1 (2011): 59-73.

Follow

Get every new post delivered to your Inbox.

Join 949 other followers

%d bloggers like this: