## Monthly Archives: September 2013

### Estimating Undirected Graphs Under Weak Assumptions

Mladen Kolar, Alessandro Rinaldo and I have uploaded a paper to arXiv entitled “Estimating Undirected Graphs Under Weak Assumptions.”

As the name implies, the goal is to estimate an undirected graph ${G}$ from random vectors ${Y_1,\ldots, Y_n \sim P}$. Here, each ${Y_i = (Y_i(1),\ldots, Y_i(D))\in\mathbb{R}^D}$ is a vector with ${D}$ coordinates, or features.

The graph ${G}$ has ${D}$ nodes, one for each feature. We put an edge between nodes ${j}$ and ${k}$ if the partial correlation ${\theta_{jk}\neq 0}$. The partial correlation ${\theta_{jk}}$ is $\displaystyle \theta_{jk} = - \frac{\Omega_{jk}}{\sqrt{\Omega_{jj}\Omega_{kk}}}$

where ${\Omega = \Sigma^{-1}}$ and ${\Sigma}$ is the ${D\times D}$ covariance matrix for ${Y_i}$.

At first sight, the problem is easy. We estimate ${\Sigma}$ with the sample covariance matrix $\displaystyle S = \frac{1}{n}\sum_{i=1}^n (Y_i - \overline{Y})(Y_i - \overline{Y})^T.$

Then we estimate ${\Omega}$ with ${\hat\Omega = S^{-1}}$. We can then use the bootstrap to get confidence intervals for each ${\theta_{jk}}$ and then we put an edge between nodes ${j}$ and ${k}$ if the confidence interval excludes 0.

But how close is the bootstrap distribution ${\hat F}$ to the true distribution ${F}$ of ${\hat\theta_{jk}}$? Our paper provides a finite sample bound on ${\sup_t | \hat F(t) - F(t)|}$. Not surprisingly, the bounds are reasonable when ${D < n}$.

What happens when ${D>n}$? In that case, estimating the distribution of ${\hat\theta_{jk}}$ is not feasible unless one imposes strong assumptions. With these extra assumptions, one can use lasso-style technology. The problem is that, the validity of the inferences then depends heavily on strong assumptions such as sparsity and eigenvalues restrictions, which are not testable if ${D_n > n}$. Instead, we take an atavistic approach: we first perform some sort of dimension reduction followed by the bootstrap. We basically give up on the original graph and instead estimate the graph for a dimension-reduced version of the problem.

If we were in a pure prediction framework I would be happy to use lasso-style technology. But, since we are engaged in inference, we take this more cautious approach.

One of the interesting parts of our analysis is that it leverages recent work on high dimensional Berry-Esseen theorems namely the results by Victor Chernozhukov, Denis Chetverikov and Kengo Kato which can be found here.

The whole issue of what assumptions are reasonable in high-dimensional inference is quite interesting. I’ll have more to say about the role of assumptions in high dimensional inference shortly. Stay tuned. In the meantime, if I have managed to spark your interest, please have a look at our paper.

### Two Announcements

A couple of announcements:

First: A message from Jeff Leek:

“We are hosting an “unconference” on Google Hangouts. We got some really amazing speakers to talk about the future of statistics. I wonder if you could help advertise the unconference on your blogs. Here is our post:

http://simplystatistics.org/2013/09/17/announcing-the-simply-statistics-unconference-on-the-future-of-statistics-futureofstats/

We also hope people will tweet their own ideas with the hashtag #futureofstatistics on Twitter.”

Second:

There is an interesting discussion at Deborah Mayo’s blog:

It is a guest post by
Owhadi, Scovel, and Sullivan on their paper:
“When Bayesian Inference Shatters”

That’s all

### Consistency, Sparsistency and Presistency

There are many ways to discuss the quality of estimators in statistics. Today I want to review three common notions: presistency, consistency and sparsistency. I will discuss them in the context of linear regression. (Yes, that’s presistency, not persistency.)

Suppose the data are ${(X_1,Y_1),\ldots, (X_n,Y_n)}$ where $\displaystyle Y_i = \beta^T X_i + \epsilon_i,$ ${Y_i\in\mathbb{R}}$, ${X_i\in\mathbb{R}^d}$ and ${\beta\in\mathbb{R}^d}$. Let ${\hat\beta=(\hat\beta_1,\ldots,\hat\beta_d)}$ be an estimator of ${\beta=(\beta_1,\ldots,\beta_d)}$.

Probably the most familiar notion is consistency. We say that ${\hat\beta}$ is consistent if $\displaystyle ||\hat\beta - \beta|| \stackrel{P}{\rightarrow} 0$

as ${n \rightarrow \infty}$.

In recent years, people have become interested in sparsistency (a term invented by Pradeep Ravikumar). Define the support of ${\beta}$ to be the location of the nonzero elements: $\displaystyle {\rm supp}(\beta) = \{j:\ \beta_j \neq 0\}.$

Then ${\hat\beta}$ is sparsistent if $\displaystyle \mathbb{P}({\rm supp}(\hat\beta) = {\rm supp}(\beta) ) \rightarrow 1$

as ${n\rightarrow\infty}$.

The last one is what I like to call presistence. I just invented this word. Some people call it risk consistency or predictive consistency. Greenshtein and Ritov (2004) call it persistency but this creates confusion for those of us who work with persistent homology. Of course, presistence come from shortening “predictive consistency.”

Let ${(X,Y)}$ be a new pair. The predictive risk of ${\beta}$ is $\displaystyle R(\beta) = \mathbb{E}(Y-X^T \beta)^2.$

Let ${{\cal B}_n}$ be some set of ${\beta}$‘s and let ${\beta_n^*}$ be the best ${\beta}$ in ${{\cal B}_n}$. That is, ${\beta_n^*}$ minimizes ${R(\beta)}$ subject to ${\beta \in {\cal B}_n}$. Then ${\hat\beta}$ is presistent if $\displaystyle R(\hat\beta) - R(\beta_n^*) \stackrel{P}{\rightarrow} 0.$

This means that ${\hat\beta}$ predicts nearly as well as the best choice of ${\beta}$. As an example, consider the set of sparse vectors $\displaystyle {\cal B}_n = \Bigl\{ \beta:\ \sum_{j=1}^d |\beta_j| \leq L\Bigr\}.$

(The dimension ${d}$ is allowed to depend on ${n}$ which is why we have a subscript on ${{\cal B}_n}$.) In this case, ${\beta_n^*}$ can be interpreted as the best sparse linear predictor. The corresponding sample estimator ${\hat\beta}$ which minimizes the sums of squares subject to being in ${{\cal B}_n}$, is the lasso estimator. Greenshtein and Ritov (2004) proved that the lasso is presistent under essentially no conditions.

This is the main message of this post: To establish consistency or sparsistency, we have to make lots of assumptions. In particular, we need to assume that the linear model is correct. But we can prove presistence with virtually no assumptions. In particular, we do not have to assume that the linear model is correct.

Presistence seems to get less attention than consistency of sparsistency but I think it is the most important of the three.

Bottom line: presistence deserves more attention. And, if you have never read Greenshtein and Ritov (2004), I highly recommend that you read it.

Reference:

Greenshtein, Eitan and Ritov, Ya’Acov. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli, 10, 971-988.

### Is Bayesian Inference a Religion?

Time for a provocative post.

There is a nice YouTube video with Tony O’Hagan interviewing Dennis Lindley. Of course, Dennis is a legend and his impact on the field of statistics is huge.

At one point, Tony points out that some people liken Bayesian inference to a religion. Dennis claims this is false. Bayesian inference, he correctly points out, starts with some basic axioms and then the rest follows by deduction. This is logic, not religion.

I agree that the mathematics of Bayesian inference is based on sound logic. But, with all due respect, I think Dennis misunderstood the question. When people say that “Bayesian inference is like a religion,” they are not referring to the logic of Bayesian inference. They are referring to how adherents of Bayesian inference behave.

(As an aside, detractors of Bayesian inference do not deny the correctness of the logic. They just don’t think the axioms are relevant for data analysis. For example, no one doubts the axioms of Peano arithmetic. But that doesn’t imply that arithmetic is the foundation of statistical inference. But I digress.)

The vast majority of Bayesians are pragmatic, reasonable people. But there is a sub-group of die-hard Bayesians who do treat Bayesian inference like a religion. By this I mean:

1. They are very cliquish.
2. They have a strong emotional attachment to Bayesian inference.
3. They are overly sensitive to criticism.
4. They are unwilling to entertain the idea that Bayesian inference might have flaws.
5. When someone criticizes Bayes, they think that critic just “doesn’t get it.”
6. They mock people with differing opinions.

To repeat: I am not referring to most Bayesians. I am referring to a small subgroup. And, yes, this subgroup does treat it like a religion. I speak from experience because I went to all the Bayesian conferences for many years, and I watched witty people at the end-of-conference Cabaret, perform plays and songs that merrily mocked frequentists. It was all in fun and I enjoyed it. But you won’t see that at a non-Bayesian conference.

No evidence you can provide would ever make the die-hards doubt their ideas. To them, Sir David Cox, Brad Efron and other giants in our field who have doubts about Bayesian inference, are not taken seriously because they “just don’t get it.”

It is my belief that Dennis agrees with me on this. (If you are reading this Dennis, and you disagree, please let me know.) Here is my evidence. Many years ago, there was a debate about whether to start a Bayesian journal. Dennis argued strongly against it precisely because he feared it would make Bayesian inference appear like a sect. Instead, he argued, Bayesians should just think of themselves as statisticians and send their papers to JASA, the Annals, etc. I think Dennis was 100 percent correct. Dennis lost this fight, and we ended up with the journal Bayesian Analysis.

So is Bayesian inference a religion? For most Bayesians: no. But for the thin-skinned, inflexible die-hards who have attached themselves so strongly to their approach to inference that they make fun of, or get mad at, critics: yes, it is a religion.