Monthly Archives: July 2013

The Steep Price of Sparsity

The Steep Price of Sparsity

We all love sparse estimators these days. I am referring to things like the lasso and related variable selection methods.

But there is a dark side to sparsity. It’s what Hannes Leeb and Benedikt Potscher call the “return of the Hodges’ estimator.” Basically, any estimator that is capable of producing sparse estimators has infinitely bad minimax risk.

1. Hodges Estimator

Let’s start by recalling Hodges famous example.

Suppose that ${X_1,\ldots,X_n \sim N(\theta,1)}$. Define ${\hat\theta}$ as follows:

${\hat\theta =\overline{X}}$ if ${|\overline{X}| \geq n^{-1/4}}$

${\hat\theta=0}$ if ${|\overline{X}| < n^{-1/4}}$.

If ${\theta\neq 0}$, then ${\hat\theta}$ will equal ${\overline{X}}$ for all large ${n}$. But if ${\theta=0}$, then eventually ${\hat\theta =0}$. The estimator discovers that ${\theta}$ is 0.

This seems like a good thing. This is what we want whenever we do model selection. We want to discover that some coefficients are 0. That’s the nature of using sparse methods.

But there is a price to pay for sparsity. The Hodges estimator has the unfortunate property that the maximum risk is terrible. Indeed,

$\displaystyle \sup_\theta \mathbb{E}_\theta [n (\hat\theta-\theta)^2] \rightarrow \infty.$

Contrast this will the sample mean:

$\displaystyle \sup_\theta \mathbb{E}_\theta [n (\overline{X}-\theta)^2] =1.$

The reason for the poor behavior of the Hodges estimator — or any sparse estimator — is that there is a zone near 0 where the estimator will be very unstable. The estimator has to decide when to switch from ${\overline{X}}$ to 0 creating a zone of instability.

I plotted the risk function of the Hodges estimator here. The risk of the mle is flat. The large peaks in the risk function of the Hodges estimator are very clear (and very disturbing).

2. The Steep Price of Sparsity

Leeb and Potscher (2008) proved that this poor behavior holds for all sparse estimators and all loss functions. Suppose that

$\displaystyle Y_i = x_i^T \beta + \epsilon_i,\ \ \ i=1,\ldots, n.$

here, ${\beta\in \mathbb{R}^k}$. Let ${s(\beta)}$ be the support of ${\beta}$: ${s_j(\beta)=1}$ of ${\beta_j \neq 0}$ and ${s_j(\beta)=0}$ of ${\beta_j = 0}$. Say that ${\hat\beta}$ is sparsistent (a term invented by Pradeep Ravikumar) if, for each ${\beta}$,

$\displaystyle P^n_\beta(s(\hat\beta)=s(\beta)) \rightarrow 1$

as ${n\rightarrow \infty}$.

Leeb and Potscher (2008) showed that if ${\hat\beta}$ is sparsistent, then

$\displaystyle \sup_\beta E[ n\, ||\hat\beta-\beta||^2]\rightarrow \infty$

as ${n\rightarrow \infty}$. More generally, for any non-negative loss function ${\ell(\hat\beta-\beta)}$, we have

$\displaystyle \sup_\beta E[ \ell(n^{1/2}(\hat\beta -\beta))]\rightarrow \sup_\beta \ell(\beta).$

3. How Should We Interpret This Result?

One might object that the maximum risk is too conservative and includes extreme cases. But in this case, that is not true. The high values of the risk occur in a small neighborhood of 0. (Recall the picture of the risk of the Hodges estimator.) This is far from pathological.

Another objection is that the proof assumes ${n}$ grows while ${k}$ stays fixed. But in many applications that we are interested in, ${k}$ grows and is even possibly larger than ${n}$. This is a valid objection. On the other hand, if there is unfavorable behavior in the ideal case of fixed ${k}$, we should not be sanguine about the high-dimensional case.

I am not suggesting we should give up variable selection. I use variable selection all the time. But we should keep in mind that there is a steep price to pay for sparsistency.

References

Leeb, Hannes and Potscher, Benedikt M. (2008). Sparse estimators and the oracle property, or the return of Hodges’ estimator. Journal of Econometrics, 142, 201-211.

Leeb, Hannes and Potscher, Benedikt M. (2005). Model selection and inference: Facts and fiction. Econometric Theory, 21, 21-59.

THE FIVE: Jeff Leek’s Challenge

Jeff Leek, over at Simply Statistics asks an interesting question: What are the 5 most influential statistics papers of 2000-2010?

I found this to be incredibly difficult to answer. Eventually, I came up with this list:

Donoho, David (2006). Compressed sensing. IEEE Transactions on Information Theory. 52, 1289-1306.

Greenshtein, Eitan and Ritov, Ya’Acov. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli, 10, 971-988.

Meinshausen, Nicolai and Buhlmann, Peter. (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34, 1436-1462.

Efron, Bradley and Hastie, Trevor and Johnstone, Iain and Tibshirani, Robert. (2004). Least angle regression. The Annals of statistics, 32, 407-499.

Hofmann, Thomas and Scholkopf, Bernhard and Smola, Alexander J. (2008). Kernel methods in machine learning. The Annals of Statistics. 1171–1220.

These are all very good papers. These papers had a big impact on me. More precisely, they are representative of ideas that had an impact on me. It’s more like there are clusters of papers and these are prototypes from those clusters. I am not really happy with my list. I feel like I must be forgetting some really important papers. Perhaps I am just getting old and forgetful. Or maybe our field is not driven by specific papers.

What five would you select? (Please post them at Jeff’s blog too.)

LOST CAUSES IN STATISTICS II: Noninformative Priors

LOST CAUSES IN STATISTICS II: Noninformative Priors

I thought I would post at a higher frequency in the summer. But I have been working hard to finish some papers which has kept me quite busy. So, apologies for the paucity of posts.

Today I’ll discuss another lost cause: noninformative priors.

I like to say that noninformative priors are the perpetual motion machines of statistics. Everyone wants one but they don’t exist.

By definition, a prior represents information. So it should come as no surprise that a prior cannot represent lack of information.

The first “noninformative prior” was of course the flat prior. The major flaw with this prior is lack of invariance: if it is flat in one parameterization it will not be flat in most other parameterizations. Flat prior have lots of other problems too. See my earlier post here.

The most famous noninformative prior (I’ll stop putting quotes around the phrase from now on) is Jeffreys prior which is proportional to the square root of the determinant of the Fisher information matrix. While this prior is invariant, it can still have undesirable properties. In particular, while it may seem noninformative for a parameter ${\theta}$ it can end up being highly informative for functions of ${\theta}$. For example, suppose that ${Y}$ is multivariate Normal with mean vector ${\theta}$ and identity covariance. The Jeffreys prior is the flat prior ${\pi(\theta) \propto 1}$. Now suppose that we want to infer ${\psi = \sum_j \theta_j^2}$. The resulting posterior for ${\psi}$ is a disaster. The coverage of the Bayesian ${1-\alpha}$ posterior interval can be close to 0.

This is a general problem with noninformative priors. If ${\pi(\theta)}$ is somehow noninformative for ${\theta}$, it may still be highly informative for sub-parameters, that is for functions ${\psi = g(\theta)}$ where ${\theta\in \mathbb{R}^d}$ and ${\psi: \mathbb{R}^d \rightarrow \mathbb{R}}$.

Jim Berger and Jose Bernardo wrote a series of interesting papers about priors that were targeted to be noninformative for particular functions of ${\theta}$. These are often called reference priors. But what if you are interested in many functions of ${\theta}$. Should you use a different prior for each function of interest?

A more fundamental question is: what does it mean for a prior to be noninformative? Of course, people have argued about this for many, many years. One definition, which has the virtue of being somewhat precise, is that a prior is noninformative if the ${1-\alpha}$ posterior regions have frequentist coverage equal (approximately) to ${1-\alpha}$. These are sometimes called “matching priors.”

In general, it is hard to construct matching priors especially in high-dimensional complex models. But matching priors raise a fundamental question: if your goal is to match frequentist coverage, why bother with Bayes at all? Just use a frequentist confidence interval.

These days I think that most people agree that the virtue of Bayesian methods is that it gives you a way to include prior information in a systematic way. There is no reason to formulate a “noninformative prior.”

On the other hand, in practice, we often deal with very complex, high-dimensional models. Can we really formulate a meaningful informative prior in such problems? And if we do, will anyone care about our inferences?

In 1996, I wrote a review paper with Rob Kass on noninformative priors (Kass and Wasserman 1996). We emphasized that a better term might be “default prior” since that seems more honest and promises less. One of our conclusions was:

“We conclude that the problems raised by the research on priors chosen by formal rules are serious and may not be dismissed lightly: When sample sizes are small (relative the number of parameters being estimated), it is dangerous to put faith in any default solution; but when asymptotics take over, Jeffreys’s rules and their variants remain reasonable choices.”

Looking at this almost twenty years later, the one thing that has changed is the “the number of parameters being estimated” which these days is often very, very large.

My conclusion: noninformative priors are a lost cause.

Reference

Kass, Robert E and Wasserman, Larry. (1996). The selection of prior distributions by formal rules. Journal of the American Statistical Association, 91, 1343-1370.