## Consistency, Sparsistency and Presistency

There are many ways to discuss the quality of estimators in statistics. Today I want to review three common notions: presistency, consistency and sparsistency. I will discuss them in the context of linear regression. (Yes, that’s presistency, not persistency.)

Suppose the data are ${(X_1,Y_1),\ldots, (X_n,Y_n)}$ where $\displaystyle Y_i = \beta^T X_i + \epsilon_i,$ ${Y_i\in\mathbb{R}}$, ${X_i\in\mathbb{R}^d}$ and ${\beta\in\mathbb{R}^d}$. Let ${\hat\beta=(\hat\beta_1,\ldots,\hat\beta_d)}$ be an estimator of ${\beta=(\beta_1,\ldots,\beta_d)}$.

Probably the most familiar notion is consistency. We say that ${\hat\beta}$ is consistent if $\displaystyle ||\hat\beta - \beta|| \stackrel{P}{\rightarrow} 0$

as ${n \rightarrow \infty}$.

In recent years, people have become interested in sparsistency (a term invented by Pradeep Ravikumar). Define the support of ${\beta}$ to be the location of the nonzero elements: $\displaystyle {\rm supp}(\beta) = \{j:\ \beta_j \neq 0\}.$

Then ${\hat\beta}$ is sparsistent if $\displaystyle \mathbb{P}({\rm supp}(\hat\beta) = {\rm supp}(\beta) ) \rightarrow 1$

as ${n\rightarrow\infty}$.

The last one is what I like to call presistence. I just invented this word. Some people call it risk consistency or predictive consistency. Greenshtein and Ritov (2004) call it persistency but this creates confusion for those of us who work with persistent homology. Of course, presistence come from shortening “predictive consistency.”

Let ${(X,Y)}$ be a new pair. The predictive risk of ${\beta}$ is $\displaystyle R(\beta) = \mathbb{E}(Y-X^T \beta)^2.$

Let ${{\cal B}_n}$ be some set of ${\beta}$‘s and let ${\beta_n^*}$ be the best ${\beta}$ in ${{\cal B}_n}$. That is, ${\beta_n^*}$ minimizes ${R(\beta)}$ subject to ${\beta \in {\cal B}_n}$. Then ${\hat\beta}$ is presistent if $\displaystyle R(\hat\beta) - R(\beta_n^*) \stackrel{P}{\rightarrow} 0.$

This means that ${\hat\beta}$ predicts nearly as well as the best choice of ${\beta}$. As an example, consider the set of sparse vectors $\displaystyle {\cal B}_n = \Bigl\{ \beta:\ \sum_{j=1}^d |\beta_j| \leq L\Bigr\}.$

(The dimension ${d}$ is allowed to depend on ${n}$ which is why we have a subscript on ${{\cal B}_n}$.) In this case, ${\beta_n^*}$ can be interpreted as the best sparse linear predictor. The corresponding sample estimator ${\hat\beta}$ which minimizes the sums of squares subject to being in ${{\cal B}_n}$, is the lasso estimator. Greenshtein and Ritov (2004) proved that the lasso is presistent under essentially no conditions.

This is the main message of this post: To establish consistency or sparsistency, we have to make lots of assumptions. In particular, we need to assume that the linear model is correct. But we can prove presistence with virtually no assumptions. In particular, we do not have to assume that the linear model is correct.

Presistence seems to get less attention than consistency of sparsistency but I think it is the most important of the three.

Bottom line: presistence deserves more attention. And, if you have never read Greenshtein and Ritov (2004), I highly recommend that you read it.

Reference:

Greenshtein, Eitan and Ritov, Ya’Acov. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli, 10, 971-988.

1. Corey
Posted September 11, 2013 at 10:52 pm | Permalink

From what I’ve seen, predictive consistency in various senses and settings is always easier to achieve than parameter consistency. (There was some funky thing about this in Grünwald’s MDL book that I’ve been meaning to go back and review…)

• Jilles
Posted September 12, 2013 at 9:14 am | Permalink

Indeed. Also, if only at first glance, ‘presistency’ seems to have connections to Universal Codes, and the ‘close to the best’ property rings rather familiar to the Sharkov/Normalized Maximum Likelihood in particular.

2. anon
Posted September 12, 2013 at 4:36 am | Permalink

To prove consistency you need assumptions about the data. To prove “presistency” you need assumptions on your comparison class of estimators, B_n. The “no assumptions about the data” is usually used as a selling point for expert advise approaches. It’s ok, as long as you remember the other part- assumptions on the comparison class. But please don’t make the next step of calling the setting “adversarial.”

3. Phil Koop
Posted September 12, 2013 at 12:42 pm | Permalink

Dumb question (so, suitable apologies): the convergence mode of sparsistency is not specified; I presume it is in the a.s. sense. The other two measures are required to converge only in probability. Why is it important that the convergence of sparsistency be stronger?

• normaldeviate
Posted September 12, 2013 at 12:44 pm | Permalink

it is in prob
P(support(estimate) = support(truth) –> 1

4. Geoff Gordon
Posted September 12, 2013 at 1:44 pm | Permalink

I think one of the reasons presistency receives less attention (despite its importance) is that people inevitably want to interpret the regression coefficient vector beta. Unfortunately, at least for the typical case, such interpretation depends on pretty strong assumptions — the same sort of assumptions as for sparsistency, such as that your model class contains Nature’s true model.

5. José E. Chacón
Posted September 29, 2013 at 3:29 am | Permalink

Recall that not always you need to make lots of assumptions to show consistency, as it is highlighted in Stone, C. J. (1977) Consistent nonparametric regression. Ann. Statist., 5, 595-620, where consistency is shown with no conditions on the joint distribution of (X,Y).