High Dimensional Undirected Graphical Models
Larry Wasserman
Graphical models have now become a common tool in applied statistics. Here is a graph representing stock data:
Here is one for proteins: (Maslov and Sneppen, 2002).
And here is one for the voting records of senators: (Banerjee, El Ghaoui, and d’Aspremont ,2008).
Like all statistical methods, estimating graphs is challenging in high dimensional problems.
1. Undirected Graphical Models
Let be a random vector with distribution
. A graph
for
has
nodes, or vertices, one for each variable. Some pairs of vertices are connected by edges. We can represent the edges as a set
of unordered pairs:
if and only if there is an edge between
and
.
We omit an edge between and
to mean that
and
are independent, given the other variables which we write as
where “rest” denotes the rest of the variables. For example, in this graph
we see that since there is not edge between
and
.
Now suppose we observe random vectors
The goal is to estimate from the data. As an example, imagine that
is a vector of gene expression levels for subject
. The graph gives us an idea of how the genes are related.
2. The Gaussian Case
Things are easiest if we assume that is a multivariate Gaussian. Suppose
. In this case, there is no edge between
and
iff
where
. If the dimension
of
is smaller than the sample size
, then estimating the graph is straightforward. We can use the sample covariance matrix
to estimate and we use
to estimate
. It is then easy to test
. We put an edge between
and
if the test rejects. The R package SIN will do this for you.
When , the simple approach above won’t work. Perhaps the easiest approach is the method due to Meinshausen and B{ü}hlmann (2006) which John Lafferty has aptly named parallel regression. The idea is this: you regress
on all the other variables, then you regress
on all the other variables, etc. For each regression, you use the lasso which yields sparse estimators. When you regress
on the others, there will be a regression coefficient for
. Call this
. Similarly, when you regress
on the others, there will be a regression coefficient for
. Call this
. If
and
then you put an edge between
and
.
An alternative — the glasso (graphical lasso) — is to maximize the Gaussian log-likelihood with a penalty:
The resulting estimator is sparse: many of its elements are 0. The non-zeroes denote edges. A fast implementation in R due to Han Liu can be found here.
3. Relaxing Gaussianity
The biggest drawback of the glasso is the assumption of Gaussianity. With my colleagues John Lafferty and Han Liu and others, I have done some work on more nonparametric approaches.
Let me briefly describe the approach in Liu, Xu, Gu, Gupta, Lafferty and Wasserman (2011). The idea is to restrict the graph to be a forest, which is a graph with no cycles.
When is a forest, the density
can be written as
where is the set of edges. We can estimate the density by inserting estimates of the univariate and bivariate marginals:
But to find the graph we need to estimate the edge set .
Any forest defines a density
by the above equation. If the true density
were known, we could choose
to minimize the Kullback-Leibler distance
This maximization can be done by Kruskal’s algorithm (Kruskal 1956) also known, in this context, as the Chow-Liu algorithm (Chow and Liu 1968). It works like this.
For each pair let
be the mutual information between and
. We start with an empty tree and add edges greedily according to the value of
. First we connect the pair of variables with the largest
and then the second largest and so on. However, we do not add an edge if it forms a cycle.
Of course, we don’t know so, instead, we use the data to estimate the mutual information
. If we simply run the Chow-Liu algorithm, we get a tree, that is, a fully connected forest. But we also use a hold-out sample to decide on when to stop adding edges. And that’s it.
There are other ways to relax the Gaussian assumption. For example, here is an approach that uses rank statistics and copulas.
4. The Future?
Not too long ago, estimating a high dimensional graph was unthinkable. Now they are used routinely. The biggest thing missing is a good measure of uncertainty. One can do some obvious things, such as bootstrapping, but it is not clear that the output is meaningful.
5. References
Banerjee, O. and El Ghaoui, L. and d’Aspremont, A. (2008). Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. The Journal of Machine Learning Research, 9, 485-516.
Chow, C. and Liu, C. (1968). IEEE Transactions on Information Theory, 14, 462-467.
Kruskal, J.B. (1956). On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical society, 7, 48-50.
Liu, H., Xu, M., Gu, H., Gupta, A., Lafferty, J. and Wasserman, L. (2011). Forest Density Estimation. Journal of Machine Learning Research, 12, 907-951.
Maslov, S. and Sneppen, K. (2002). Specificity and stability in topology of protein networks. Science, 296, 910.
Meinshausen, N. and B{ü}hlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34, 1436-1462.
4 Comments
I like your way of running this blog. There is always something to learn in it for which much more time would be needed to pick it up elsewhere.
The thing I’m interested in here is that given that we may not believe in precise independence and correlations being exactly zero. Testing and estimation are very different in this respect, because it is a very different statement to say that “the observed data are consistent with a model with zero correlation” than “we estimate the correlation to be precisely zero” (i.e., that’s our best guess).
In fact one wouldn’t be interested in estimating the non-edges where we have precise independence (for large enough n that should give a complete graph) but rather those where dependence is weak (in a well defined way such as absolut value of correlation smaller than epsilon), right?
Do you think that this should be taken into account somehow?
Yes indeed. I would prefer to have a confidence interval for the amount of conditional
dependence between each pair. Then I would show a sequence of graphs with different degrees
of dependence.
Larry
The stock data figure isn’t accompanied by a reference, so I don’t have any context for it. Was it originally pairwise correlations of asset prices?
It is log differences between successive days for S and P 500 stock.
See:
http://www.e-publications.org/ims/submission/index.php/STS/user/submissionFile/10637?confirm=67dc1223
LW
One Trackback
[…] High Dimensional Undirected Graphical Models by Larry Wasserman. […]