## High Dimensional Undirected Graphical Models

High Dimensional Undirected Graphical Models
Larry Wasserman

Graphical models have now become a common tool in applied statistics. Here is a graph representing stock data: Here is one for proteins: (Maslov and Sneppen, 2002). And here is one for the voting records of senators: (Banerjee, El Ghaoui, and d’Aspremont ,2008). Like all statistical methods, estimating graphs is challenging in high dimensional problems.

1. Undirected Graphical Models

Let ${X=(X_1,\ldots,X_d)}$ be a random vector with distribution ${P}$. A graph ${G}$ for ${P}$ has ${d}$ nodes, or vertices, one for each variable. Some pairs of vertices are connected by edges. We can represent the edges as a set ${E}$ of unordered pairs: ${(j,k)\in E}$ if and only if there is an edge between ${X_j}$ and ${X_k}$.

We omit an edge between ${X_i}$ and ${X_j}$ to mean that ${X_i}$ and ${X_j}$ are independent, given the other variables which we write as $\displaystyle X_i \amalg X_j | rest$

where “rest” denotes the rest of the variables. For example, in this graph $\displaystyle X_1 \ \ \rule{2in}{1mm} \ \ X_2 \ \ \rule{2in}{1mm}\ \ X_3$

we see that ${X_1\amalg X_3 | X_2}$ since there is not edge between ${X_1}$ and ${X_3}$.

Now suppose we observe ${n}$ random vectors $\displaystyle X^{(1)},\ldots, X^{(n)} \sim P.$

The goal is to estimate ${G}$ from the data. As an example, imagine that ${X^{(i)}}$ is a vector of gene expression levels for subject ${i}$. The graph gives us an idea of how the genes are related.

2. The Gaussian Case

Things are easiest if we assume that ${P}$ is a multivariate Gaussian. Suppose ${X\sim N(\mu,\Sigma)}$. In this case, there is no edge between ${X_j}$ and ${X_k}$ iff ${\Omega_{jk}=0}$ where ${\Omega = \Sigma^{-1}}$. If the dimension ${d}$ of ${X}$ is smaller than the sample size ${n}$, then estimating the graph is straightforward. We can use the sample covariance matrix $\displaystyle S = \frac{1}{n}\sum_{i=1}^n (X^{(i)}-\overline{X})(X^{(i)}-\overline{X})^T$

to estimate ${\Sigma}$ and we use ${S^{-1}}$ to estimate ${\Omega}$. It is then easy to test ${H_0: \Omega_{jk}=0}$. We put an edge between ${j}$ and ${k}$ if the test rejects. The R package SIN will do this for you.

When ${d>n}$, the simple approach above won’t work. Perhaps the easiest approach is the method due to Meinshausen and B{ü}hlmann (2006) which John Lafferty has aptly named parallel regression. The idea is this: you regress ${X_1}$ on all the other variables, then you regress ${X_2}$ on all the other variables, etc. For each regression, you use the lasso which yields sparse estimators. When you regress ${X_j}$ on the others, there will be a regression coefficient for ${X_k}$. Call this ${\hat\beta_{jk}}$. Similarly, when you regress ${X_k}$ on the others, there will be a regression coefficient for ${X_j}$. Call this ${\hat\beta_{kj}}$. If ${\hat\beta_{jk}\neq 0}$ and ${\hat\beta_{kj}\neq 0}$ then you put an edge between ${X_j}$ and ${X_k}$.

An alternative — the glasso (graphical lasso) — is to maximize the Gaussian log-likelihood ${\ell(\mu,\Omega)}$ with a penalty: $\displaystyle \ell(\mu,\Omega) + \lambda \sum_{j\neq k} |\Omega_{jk}|.$

The resulting estimator ${\hat\Omega}$ is sparse: many of its elements are 0. The non-zeroes denote edges. A fast implementation in R due to Han Liu can be found here.

3. Relaxing Gaussianity

The biggest drawback of the glasso is the assumption of Gaussianity. With my colleagues John Lafferty and Han Liu and others, I have done some work on more nonparametric approaches.

Let me briefly describe the approach in Liu, Xu, Gu, Gupta, Lafferty and Wasserman (2011). The idea is to restrict the graph to be a forest, which is a graph with no cycles.

When ${G}$ is a forest, the density ${p}$ can be written as $\displaystyle p(y) = \prod_{(j,k)\in E} \frac{p(y_j,y_k)}{p(y_j)p(y_k)}\prod_{s=1}^d p(y_s)$

where ${E}$ is the set of edges. We can estimate the density by inserting estimates of the univariate and bivariate marginals: $\displaystyle \hat{p}(y) = \prod_{(j,k)\in E} \frac{\hat p(y_j,y_k)}{\hat p(y_j)\hat p(y_k)}\prod_{s=1}^d \hat p(y_s).$

But to find the graph we need to estimate the edge set ${E}$.

Any forest ${F}$ defines a density ${p_F(y)}$ by the above equation. If the true density ${p}$ were known, we could choose ${F}$ to minimize the Kullback-Leibler distance $\displaystyle D(p,p_F) = \int p(y) \log\left(\frac{p(y)}{p_F(y)} \right) dy.$

This maximization can be done by Kruskal’s algorithm (Kruskal 1956) also known, in this context, as the Chow-Liu algorithm (Chow and Liu 1968). It works like this.

For each pair ${(X_j,X_k)}$ let $\displaystyle I(Y_j,Y_k) = \int p(y_j,y_k) \log \frac{p(y_j,y_k)}{p(y_j)p(y_k)} dy_j\, dy_k$

be the mutual information between ${Y_j}$ and ${Y_k}$. We start with an empty tree and add edges greedily according to the value of ${I(Y_j,Y_k)}$. First we connect the pair of variables with the largest ${I(Y_j,Y_k)}$ and then the second largest and so on. However, we do not add an edge if it forms a cycle.

Of course, we don’t know ${p}$ so, instead, we use the data to estimate the mutual information ${I(Y_j,Y_k)}$. If we simply run the Chow-Liu algorithm, we get a tree, that is, a fully connected forest. But we also use a hold-out sample to decide on when to stop adding edges. And that’s it.

There are other ways to relax the Gaussian assumption. For example, here is an approach that uses rank statistics and copulas.

4. The Future?

Not too long ago, estimating a high dimensional graph was unthinkable. Now they are used routinely. The biggest thing missing is a good measure of uncertainty. One can do some obvious things, such as bootstrapping, but it is not clear that the output is meaningful.

5. References

Banerjee, O. and El Ghaoui, L. and d’Aspremont, A. (2008). Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. The Journal of Machine Learning Research, 9, 485-516.

Chow, C. and Liu, C. (1968). IEEE Transactions on Information Theory, 14, 462-467.

Kruskal, J.B. (1956). On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical society, 7, 48-50.

Liu, H., Xu, M., Gu, H., Gupta, A., Lafferty, J. and Wasserman, L. (2011). Forest Density Estimation. Journal of Machine Learning Research, 12, 907-951.

Maslov, S. and Sneppen, K. (2002). Specificity and stability in topology of protein networks. Science, 296, 910.

Meinshausen, N. and B{ü}hlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34, 1436-1462.

1. Christian Hennig
Posted September 19, 2012 at 7:57 am | Permalink

I like your way of running this blog. There is always something to learn in it for which much more time would be needed to pick it up elsewhere.
The thing I’m interested in here is that given that we may not believe in precise independence and correlations being exactly zero. Testing and estimation are very different in this respect, because it is a very different statement to say that “the observed data are consistent with a model with zero correlation” than “we estimate the correlation to be precisely zero” (i.e., that’s our best guess).
In fact one wouldn’t be interested in estimating the non-edges where we have precise independence (for large enough n that should give a complete graph) but rather those where dependence is weak (in a well defined way such as absolut value of correlation smaller than epsilon), right?
Do you think that this should be taken into account somehow?

• normaldeviate
Posted September 19, 2012 at 8:30 am | Permalink

Yes indeed. I would prefer to have a confidence interval for the amount of conditional
dependence between each pair. Then I would show a sequence of graphs with different degrees
of dependence.
Larry

2. ndronen
Posted September 20, 2012 at 4:21 pm | Permalink

The stock data figure isn’t accompanied by a reference, so I don’t have any context for it. Was it originally pairwise correlations of asset prices?