Mixture Models: The Twilight Zone of Statistics
Mixture models come up all the time and they are obviosly very useful. Yet they are strange beasts.
1. The Gaussian Mixture
One of the simplest mixture models is a finite mixture of Gaussians:
Here, denotes a Gaussian density with mean vector and covariance matrix . The weights are non-negative and sum to 1. The entire list of parameters is
One can also consider , the number of components, to be another parameter.
2. The Wierd Things That Happen With Mixtures
Now let’s consider some of the wierd things that can happen.
Infinite Likelihood. The likelihood function (for the Gaussian mixture) is infinite at some points in the parameter space. This is not necessarily deadly; the infinities are at the boundary and you can use the largest (finite) maximum in the interior as an estimator. But the infinities can cause numerical problems.
Multimodality of the Likelihood. In fact, the likelihood has many modes. Finding the global (but not infinite) mode is a nightmare. The EM algorithm only finds local modes. In a sense, the MLE is not really a well-defined estimator because we can’t really find it. In machine learning, there has been a bunch of papers trying to find estimators for mixture models that can be found in polynomial time. For example, see this paper.
Multimodality of the Density. You might think that a mixture of Gaussians would have modes. But, in fact, it can have less than or more than . See Carreira-Perpinan and Williams (2003) and Edelsbrunner, Fasy and Rote (2012).
Nonidentifability. A model is identifiable if
Mixture models are nonidentifiable in two different ways. First, there is nonidentifiability due to permutation of labels. This is a nuisance but not a big deal. A bigger issue is local nonidentifiability. Suppose that
When , we have that . The parameter has disappeared. Similarly, when , the parameter disappears. This means that there are subspaces of the parameter space where the family is not identifiable. The result is that all the usual theory about the distribution of the MLE, the distribution of the likelihood rato statistic, the properties of BIC etc. becomes very, very complicated.
Irregularity. Mixture models do not satisfy the usual regularity conditions that make parametric models so easy to deal with. Consider the following example from Chen (1995). Let
Then where is the Fisher information. Moreover, no estimator of can converge faster than . Compare this to a Normal family where the Fisher information is and the maximum likelihood estimator converges at rate .
Nonintinuitive Group Membership. Mixtures are often used for finding clusters. Suppose that
with . Let denote the two components. We can compute and explicitly. We can then assign an to the first component if . It is easy to check that, with certain choices of , that all large values of get assigned to component 1 (i.e. the leftmost component). Technically this is correct, yet it seems to be an unintended consequence of the model.
Improper Posteriors. Suppose we have a sample from the simple mixture
Then any improper prior on yields an improper posterior for regardless of the how large the sample size is. Also, in Wasserman (2000) I showed that the only priors that yield posteriors in close agreement to frequentist methods are data-dependent priors.
3. So What?
So what should we make of all of this? I find it interesting that such a simple model can have such complicated behavior. I wonder if many people use mixture models without realizing all the potential complications.
I have decided that mixtures, like tequila, are inherently evil and should be avoided at all costs.
Carreira-Perpinan and Williams, C. (2003). On the number of modes of a Gaussian mixture. Scale Space Methods in Computer Vision, 625–640.
Chen, J. (1995) Optimal rate of convergence for finite mixture models. The Annals of Statistics, 23, 221–233.
Edelsbrunner, Fasy and Rote (2012). Add Isotropic Gaussian Kernels at Own Risk: More and More Resiliant Modes in Higher Dimensions. ACM Symposium on Computational Geometry (SoCG 2012).
Kalai, Adam, Moitra, Ankur and Valiant, Gregory. (2012). Disentangling Gaussians. Commun. ACM, 55, 113-120.
Wasserman, L. (2000). Asymptotic inference for mixture models by using data-dependent priors. Journal of the Royal Statistical Society: Series B, 62, 159–180.