## Mixture Models: The Twilight Zone of Statistics

Mixture Models: The Twilight Zone of Statistics
Larry Wasserman

Mixture models come up all the time and they are obviosly very useful. Yet they are strange beasts.

1. The Gaussian Mixture

One of the simplest mixture models is a finite mixture of Gaussians:

$\displaystyle p(x;\psi) = \sum_{j=1}^k w_j\, \phi(x; \mu_j,\Sigma_j).$

Here, ${\phi(x; \mu_j,\Sigma_j)}$ denotes a Gaussian density with mean vector ${\mu_j}$ and covariance matrix ${\Sigma_j}$. The weights ${w_1,\ldots,w_k}$ are non-negative and sum to 1. The entire list of parameters is

$\displaystyle \psi = (\mu_1,\ldots, \mu_k,\Sigma_1,\ldots,\Sigma_k,w_1, \ldots, w_k).$

One can also consider ${k}$, the number of components, to be another parameter.

2. The Wierd Things That Happen With Mixtures

Now let’s consider some of the wierd things that can happen.

Infinite Likelihood. The likelihood function (for the Gaussian mixture) is infinite at some points in the parameter space. This is not necessarily deadly; the infinities are at the boundary and you can use the largest (finite) maximum in the interior as an estimator. But the infinities can cause numerical problems.

Multimodality of the Likelihood. In fact, the likelihood has many modes. Finding the global (but not infinite) mode is a nightmare. The EM algorithm only finds local modes. In a sense, the MLE is not really a well-defined estimator because we can’t really find it. In machine learning, there has been a bunch of papers trying to find estimators for mixture models that can be found in polynomial time. For example, see this paper.

Multimodality of the Density. You might think that a mixture of ${k}$ Gaussians would have ${k}$ modes. But, in fact, it can have less than ${k}$ or more than ${k}$. See Carreira-Perpinan and Williams (2003) and Edelsbrunner, Fasy and Rote (2012).

Nonidentifability. A model ${\{ p(x;\theta):\ \theta\in\Theta\}}$ is identifiable if

$\displaystyle \theta_1 \neq \theta_2\ \ \ {\rm implies}\ \ \ p(x;\theta_1) \neq p(x;\theta_2). \ \ \ \ \ (1)$

Mixture models are nonidentifiable in two different ways. First, there is nonidentifiability due to permutation of labels. This is a nuisance but not a big deal. A bigger issue is local nonidentifiability. Suppose that

$\displaystyle p(x; \eta,\mu_1,\mu_2) = (1-\eta)\phi(x;\mu_1,1) + \eta\phi(x;\mu_2,1).$

When ${\mu_1=\mu_2=\mu}$, we have that ${p(x; \eta,\mu_1,\mu_2) = \phi(x;\mu)}$. The parameter ${\eta}$ has disappeared. Similarly, when ${\eta=1}$, the parameter ${\mu_2}$ disappears. This means that there are subspaces of the parameter space where the family is not identifiable. The result is that all the usual theory about the distribution of the MLE, the distribution of the likelihood rato statistic, the properties of BIC etc. becomes very, very complicated.

Irregularity. Mixture models do not satisfy the usual regularity conditions that make parametric models so easy to deal with. Consider the following example from Chen (1995). Let

$\displaystyle p(x;\theta) = \frac{2}{3} \phi(x;-\theta,1) + \frac{1}{3} \phi(x;2\theta,1).$

Then ${I(0) =0}$ where ${I(\theta)}$ is the Fisher information. Moreover, no estimator of ${\theta}$ can converge faster than ${n^{-1/4}}$. Compare this to a Normal family ${\phi(x;\theta,1)}$ where the Fisher information is ${I(\theta)=n}$ and the maximum likelihood estimator converges at rate ${n^{-1/2}}$.

Nonintinuitive Group Membership. Mixtures are often used for finding clusters. Suppose that

$\displaystyle p(x) = (1-\eta)\phi(x;\mu_1,\sigma^{2}_1) + \eta\phi(x;\mu_2,\sigma^{2}_2)$

with ${\mu_1 < \mu_2}$. Let ${Z=1,2}$ denote the two components. We can compute ${P(Z=1|X=x)}$ and ${P(Z=2|X=x)}$ explicitly. We can then assign an ${x}$ to the first component if ${P(Z=1|X=x) > P(Z=2|X=x)}$. It is easy to check that, with certain choices of ${\sigma_1,\sigma_2}$, that all large values of ${x}$ get assigned to component 1 (i.e. the leftmost component). Technically this is correct, yet it seems to be an unintended consequence of the model.

Improper Posteriors. Suppose we have a sample from the simple mixture

$\displaystyle p(x;\mu) = \frac{1}{2} \phi(x;0,1) + \frac{1}{2} \phi(x;\mu,1).$

Then any improper prior on ${\mu}$ yields an improper posterior for ${\mu}$ regardless of the how large the sample size is. Also, in Wasserman (2000) I showed that the only priors that yield posteriors in close agreement to frequentist methods are data-dependent priors.

3. So What?

So what should we make of all of this? I find it interesting that such a simple model can have such complicated behavior. I wonder if many people use mixture models without realizing all the potential complications.

I have decided that mixtures, like tequila, are inherently evil and should be avoided at all costs.

4. References

Carreira-Perpinan and Williams, C. (2003). On the number of modes of a Gaussian mixture. Scale Space Methods in Computer Vision, 625–640.

Chen, J. (1995) Optimal rate of convergence for finite mixture models. The Annals of Statistics, 23, 221–233.

Edelsbrunner, Fasy and Rote (2012). Add Isotropic Gaussian Kernels at Own Risk: More and More Resiliant Modes in Higher Dimensions. ACM Symposium on Computational Geometry (SoCG 2012).

Kalai, Adam, Moitra, Ankur and Valiant, Gregory. (2012). Disentangling Gaussians. Commun. ACM, 55, 113-120.

Wasserman, L. (2000). Asymptotic inference for mixture models by using data-dependent priors. Journal of the Royal Statistical Society: Series B, 62, 159–180.

1. Pedro Infante

I disagree about a couple of things, specially about “I have decided that mixtures, like tequila, are inherently evil and should be avoided at all costs”. Perhaps you have not try the good ones. Try

- Hornitos añejo (aged)
- Tres Generaciones añejo
- Don Julio añejo

Now, what if we have a data set presenting multimodality? There are many real examples where the sample is known to contain observations from different populations but it is impossible to identify them because of design reasons. We do not have many options for meaninfully modelling these features.

Also, not being able to use an improper prior is not the end of the world.

Perhaps what it is not recommended is to use them for empirical modelling.

• Posted August 4, 2012 at 1:06 pm | Permalink | Reply

What can I say?
a nonparametric density estimator?
—LW

• Pedro Infante
Posted August 4, 2012 at 1:42 pm | Permalink

How about “The Tyranny of Tuning Parameters: 2. Density Estimation: A Failure “?

Out of jokes, I think when there is clear information about the number of populations in the sample (like in flow cytometry) the use of mixtures is helpful and meaningful because this is reliable information that we should not sweep under the carpet.

• martin azizyan
Posted August 5, 2012 at 8:25 am | Permalink

> how about a nonparametric density estimator?

Doesn’t that hurt interpretability compared to, say, a gaussian mixture?

It seems to me that being given a mixture of gaussians, and told that “this is THE right number of components, and the best local maximum of the likelihood function” would carry more information than simply being given a nonparametric density estimate, even assuming its tuning parameters are also chosen to be perfect by some deity.

Of course I’m sure there are many nonparametric methods to get what the practitioner might hope to learn from the gaussian mixture, as well…

• Marcus
Posted August 9, 2012 at 3:59 am | Permalink

Isn’t nonparametric density estimation using kernels a mixture model with the exact same multimodality problems. Actually, in density estimation the multimodality is maximal compared to mixture models, where the number of mixtures can be less than ‘n’ for a smoother model.

• Posted August 9, 2012 at 8:56 am | Permalink

not really.
the estimator is consistent and will
(asymptotically) have the correct number of modes.
(there are fewer moded than the number of components
of the density estimator)
—LW

2. Corey

I read Wasserman (2000) for entertainment on the bus ride home a few years ago. Around that time my then-boss had me looking into matching priors. If I recall correctly, the data-dependent prior in Wasserman (2000) depends only on n, whereas other forms of data-dependent priors depend on functions of the data values. It seems to me that data-dependence through n is somehow weaker than data-dependence through observed data values, since (assuming fixed sample size as usual) it is possible to know what prior will be used in advance of data collection.

• Posted August 4, 2012 at 1:07 pm | Permalink | Reply

No it depends on the actual data
not just the sample size
—LW

• Corey
Posted August 4, 2012 at 5:53 pm | Permalink

Then I must be thinking of some other data-dependent prior I encountered around that time. Hmm…

Let me test my recollection of the take-home message of the paper. I think it was, “For mixture models, the Jeffreys prior (usually a first order probability matching prior) results in the posterior that asymptotes to a positive constant (hence improper). But if one simply subtracts the positive constant from the posterior (yielding a proper distribution), this corresponds to using a data-dependent probability matching prior (I forget the order; I’m guessing first order?).”

How close was that?

• Posted August 4, 2012 at 5:59 pm | Permalink

I believe that is correct
(but it has been a long time)
LW

3. oz

Singular learning theory (see e.g. works by Watanabe, Drton, Sturmfels) uses quite deep math. (algebraic geometry) in trying to deal with some of the problems mentioned (BIC, non-identifiability etc.)

4. BEAST of MARS

Dear stats master:

In “infinite likelihood”, can you clarify “This is not necessarily deadly; the infinities are at the boundary and you can use the largest (finite) maximum in the interior as an estimator”?

In general, the boundaries can be approached with a sequence of interior points, meaning sup likelihood over interior = infinity. But I don’t think you are suggesting enumeration of all (actual) local maxima and choosing the biggest one..!

And isn’t this property, more than anything, just a criticism of maximum likelihood? it seems that, in the mixture setting, it is a nonsensical optimization problem.

Excellent blog btw. I just wish it used more pink.

• Posted August 5, 2012 at 8:52 am | Permalink | Reply

I just mean there are finite local maxima in the interior.
The largest of these has the usual properties.

I don’t see why this is a criticism of maximum likelihood
estimators. The MLE is just an estimator; there is no principle
that says “always use the MLE.”
If the MLE has good behavior, we use it.
If it doesn’t, we don’t.

Purple, not pink!

5. Christian Hennig

Hmmmm. I’m aware of all of these and still use mixtures a lot, mainly for cluster analysis. It’s true that you shouldn’t use the solution from your software uncritically because sometimes it’s no good but often enough it’s better than any other clustering method I know, and I know a few.

Let’s go through the points. Infinite likelihood and multimodality of the likelihood, yeah, these are annoying. Still, often you can find a pretty good local optimum, and try to beat the clustering with another method! (Of course you need to know how to measure what’s good without using the model assumption.)
Multimodality of the density – well first you need to decide whether a “cluster” for you is rather associated to a Gaussian component or a density mode anyway. It’s application dependent and both make sense sometimes. If it’s the latter, you shouldn’t have started fitting a mixture model in the first place. If it’s the former, why bother?
Nonidentifiability – as long as it’s only the problems that you mention, they won’t usually hurt unless you’re a Bayesian (although for some mixtures there are worse), and neither will the improper prior problem. Well, to be honest, they may hurt a bit but by far not as much as the problems with the likelihood.
The nonintuitive group memberships are not nonintuitive to me, and how strong an argument is this kind of intuition anyway?

Here is another one. The BIC is believed to be a consistent estimator for the number of components (this is proved to my knowledge for a quite restricted case only by Keribin, 2000). However, the truth is never precisely a Gaussian mixture, and it is for n to infinity in all likelihood best approximated by a bigger and bigger mixture. So this means that consistency of the BIC will imply that the estimated k diverges to infinity with n, and if you want to do clustering, that’s bad because you rather want a small number of not exactly Gaussian clusters. So consistency here is a bad thing and the BIC is good, if at all, only for small to moderate, but never for large n *because* it’s consistent.