High Dimensional Undirected Graphical Models

Larry Wasserman

Graphical models have now become a common tool in applied statistics. Here is a graph representing stock data:

Here is one for proteins: (Maslov and Sneppen, 2002).

And here is one for the voting records of senators: (Banerjee, El Ghaoui, and d’Aspremont ,2008).

Like all statistical methods, estimating graphs is challenging in high dimensional problems.

**1. Undirected Graphical Models **

Let be a random vector with distribution . A graph for has nodes, or vertices, one for each variable. Some pairs of vertices are connected by edges. We can represent the edges as a set of unordered pairs: if and only if there is an edge between and .

We omit an edge between and to mean that and are independent, given the other variables which we write as

where “rest” denotes the rest of the variables. For example, in this graph

we see that since there is not edge between and .

Now suppose we observe random vectors

The goal is to estimate from the data. As an example, imagine that is a vector of gene expression levels for subject . The graph gives us an idea of how the genes are related.

**2. The Gaussian Case **

Things are easiest if we assume that is a multivariate Gaussian. Suppose . In this case, there is no edge between and iff where . If the dimension of is smaller than the sample size , then estimating the graph is straightforward. We can use the sample covariance matrix

to estimate and we use to estimate . It is then easy to test . We put an edge between and if the test rejects. The R package SIN will do this for you.

When , the simple approach above won’t work. Perhaps the easiest approach is the method due to Meinshausen and B{ü}hlmann (2006) which John Lafferty has aptly named *parallel regression*. The idea is this: you regress on all the other variables, then you regress on all the other variables, etc. For each regression, you use the lasso which yields sparse estimators. When you regress on the others, there will be a regression coefficient for . Call this . Similarly, when you regress on the others, there will be a regression coefficient for . Call this . If and then you put an edge between and .

An alternative — the glasso (graphical lasso) — is to maximize the Gaussian log-likelihood with a penalty:

The resulting estimator is sparse: many of its elements are 0. The non-zeroes denote edges. A fast implementation in R due to Han Liu can be found here.

**3. Relaxing Gaussianity **

The biggest drawback of the glasso is the assumption of Gaussianity. With my colleagues John Lafferty and Han Liu and others, I have done some work on more nonparametric approaches.

Let me briefly describe the approach in Liu, Xu, Gu, Gupta, Lafferty and Wasserman (2011). The idea is to restrict the graph to be a forest, which is a graph with no cycles.

When is a forest, the density can be written as

where is the set of edges. We can estimate the density by inserting estimates of the univariate and bivariate marginals:

But to find the graph we need to estimate the edge set .

Any forest defines a density by the above equation. If the true density were known, we could choose to minimize the Kullback-Leibler distance

This maximization can be done by Kruskal’s algorithm (Kruskal 1956) also known, in this context, as the Chow-Liu algorithm (Chow and Liu 1968). It works like this.

For each pair let

be the mutual information between and . We start with an empty tree and add edges greedily according to the value of . First we connect the pair of variables with the largest and then the second largest and so on. However, we do not add an edge if it forms a cycle.

Of course, we don’t know so, instead, we use the data to estimate the mutual information . If we simply run the Chow-Liu algorithm, we get a tree, that is, a fully connected forest. But we also use a hold-out sample to decide on when to stop adding edges. And that’s it.

There are other ways to relax the Gaussian assumption. For example, here is an approach that uses rank statistics and copulas.

**4. The Future? **

Not too long ago, estimating a high dimensional graph was unthinkable. Now they are used routinely. The biggest thing missing is a good measure of uncertainty. One can do some obvious things, such as bootstrapping, but it is not clear that the output is meaningful.

**5. References **

Banerjee, O. and El Ghaoui, L. and d’Aspremont, A. (2008). Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. *The Journal of Machine Learning Research*, 9, 485-516.

Chow, C. and Liu, C. (1968). *IEEE Transactions on Information Theory*, 14, 462-467.

Kruskal, J.B. (1956). On the shortest spanning subtree of a graph and the traveling salesman problem. *Proceedings of the American Mathematical society*, 7, 48-50.

Liu, H., Xu, M., Gu, H., Gupta, A., Lafferty, J. and Wasserman, L. (2011). Forest Density Estimation. *Journal of Machine Learning Research*, 12, 907-951.

Maslov, S. and Sneppen, K. (2002). Specificity and stability in topology of protein networks. *Science*, 296, 910.

Meinshausen, N. and B{ü}hlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. *The Annals of Statistics*, 34, 1436-1462.

## 4 Comments

I like your way of running this blog. There is always something to learn in it for which much more time would be needed to pick it up elsewhere.

The thing I’m interested in here is that given that we may not believe in precise independence and correlations being exactly zero. Testing and estimation are very different in this respect, because it is a very different statement to say that “the observed data are consistent with a model with zero correlation” than “we estimate the correlation to be precisely zero” (i.e., that’s our best guess).

In fact one wouldn’t be interested in estimating the non-edges where we have precise independence (for large enough n that should give a complete graph) but rather those where dependence is weak (in a well defined way such as absolut value of correlation smaller than epsilon), right?

Do you think that this should be taken into account somehow?

Yes indeed. I would prefer to have a confidence interval for the amount of conditional

dependence between each pair. Then I would show a sequence of graphs with different degrees

of dependence.

Larry

The stock data figure isn’t accompanied by a reference, so I don’t have any context for it. Was it originally pairwise correlations of asset prices?

It is log differences between successive days for S and P 500 stock.

See:

http://www.e-publications.org/ims/submission/index.php/STS/user/submissionFile/10637?confirm=67dc1223

LW

## One Trackback

[…] High Dimensional Undirected Graphical Models by Larry Wasserman. […]