## The Amazing Mean Shift Algorithm

The mean shift algorithm is a mode-based clustering method due to Fukunaga and Hostetler (1975) that is commonly used in computer vision but seems less well known in statistics.

The steps are: (1) estimate the density, (2) find the modes of the density, (3) associate each data point to one mode.

1. The Algorithm

We are given data ${Y_1,\ldots, Y_n \in \mathbb{R}^d}$ which are a sample from a density ${p}$. Let $\displaystyle \hat p(y) = \frac{1}{n}\sum_{i=1}^n \frac{1}{h^d} K\left(\frac{||y - Y_i||}{h}\right)$

be a kernel density estimator with kernel ${K}$ and bandwidth ${h}$. Let ${g(y) = \nabla \hat p(y)}$ denote the gradient of ${\hat p(y)}$.

Suppose that ${\hat p}$ has modes ${m_1,\ldots, m_k}$. (Note that ${k}$ is not determined in advance; it is simply the number of modes in the density estimator.) An arbitrary point ${y\in\mathbb{R}^d}$ is assigned to mode ${m_j}$ if the gradient ascent curve through ${y}$ leads to ${m_j}$. More formally, the gradient ${g}$ defines a family of integral curves ${\pi}$. An integral curve ${\pi:\mathbb{R}\rightarrow\mathbb{R}^d}$ is defined by the differential equation $\displaystyle \pi'(t) = \nabla g(\pi(t)).$

In words, moving along ${\pi}$ corresponds to moving up in the direction of the gradient. The integral curves partition the space.

For each mode ${m_j}$, define the set ${A_j}$ to be the set of all ${y}$ such that the integral curve starting at ${y}$ leads to ${m_j}$. Intuitively, if we place a particle at ${y\in A_j}$ and let it flow up ${\hat p}$, it will end up at ${m_j}$. That is, if ${\pi}$ is the integral curve starting at ${y}$, then ${\lim_{t\rightarrow\infty} \pi(t) =m_j}$. The sets ${A_1,\ldots, A_k}$ form a partition. We can now partition the data according to which sets ${A_j}$ they fall into. In other words, we define clusters ${C_1,\ldots, C_k}$ where ${C_j = \{ Y_i:\ Y_i \in A_j\}}$.

To find the clusters, we only need to start at each ${Y_i}$, move ${Y_i}$ along its gradient ascent curve ${\pi}$ and see which mode it goes to. This is where the mean shift algorithm comes in.

The mean shift algorithm approximates the integral curves (i.e. the gradient ascent curves). Although it is usually applied to the data ${Y_i}$, it can be applied to any set of points. Suppose we pick some point ${w}$ and we want to find out which mode it belongs to. We set ${w_0=w}$ and then, for ${t=0,1,2,\ldots}$, set $\displaystyle w_{t+1} \longleftarrow \frac{\sum_{i=1}^n Y_i K\left(\frac{||w_t - Y_i||}{h}\right)} {\sum_{i=1}^n K\left(\frac{||w_t - Y_i||}{h}\right)}.$

In other words, replace ${w_t}$ with a weighted average around ${w_t}$ and repeat. The trajectory ${w_0, w_1, w_2,\ldots,}$ converges to a mode ${m_j}$.

In practice, we run the algorithm, not on some arbitrary ${w}$ but on each data point ${Y_i}$. Thus, by running the algorithm you accomplish two things: you find the modes and you find which mode each data point ${Y_i}$ goes to. Hence, you find the mode-based clusters.

The algorithm can be slow, but the literature is filled with papers with speed-ups, improvements etc.

Here is a simple example: The big black dots are data points. The blue crosses are the modes. The red curves are the mean shift paths. Pretty, pretty cool.

2. Some Theory

Donoho and Liu (1991) showed that, if ${p}$ behaves in a locally quadratic fashion around the modes, then the minimax rate for estimating modes is ${n^{-\frac{1}{4+d}}}$ and that the rate is achieved by kernel density estimators. Thus, the above procedure has some theoretical foundation. (Their result s for ${d=1}$ but I am pretty sure it extends to ${d>1}$.)

More recently, Klemel{ä} (2005) showed that it is possible to obtain minimax estimators of the mode that adapt to the local behavior of ${p}$ around the modes. He uses a method due to Lepski (1992) that is a fairly general way to do adaptive inference.

My colleagues (Chris Genovese, Marco Perone-Pacifico and Isa Verdinelli) and I have been working on some extensions to the mean shift algorithm. I’ll report on that when we finish the paper.

I’ll close with a question: why don’t we use this algorithm more in statistical machine learning?

3. References

Cheng, Yizong (1995). Mean Shift, Mode Seeking, and Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17, 790-799.

Donoho, D.L. and Liu, R.C. (1991). Geometrizing rates of convergence, II. The Annals of Statistics, 633-667.

Comaniciu, Dorin; Peter Meer (2002). Mean Shift: A Robust Approach Toward Feature Space Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 603-619.

Fukunaga, Keinosuke; Larry D. Hostetler (1975). The Estimation of the Gradient of a Density Function, with Applications in Pattern Recognition. IEEE Transactions on Information Theory, 21, 32-40.

Klemel{ä}, J. (2005). Adaptive estimation of the mode of a multivariate density. Journal of Nonparametric Statistics, 17, 83-105.

Lepski, O.V. (1992). On problems of adaptive estimation in white Gaussian noise. Topics in nonparametric estimation, 12, 87-106.

—Larry Wasserman

1. eric
Posted July 20, 2012 at 11:49 pm | Permalink

For gaussian mixture models, there can be more modes in the density than the number of clusters. Donohos assumption is that this won’t happen, but can this assumption become problematic?

• normaldeviate
Posted July 21, 2012 at 9:00 am | Permalink

Donoho does not assume a mixture model.
Just a density function with locally quadratic modes.

–LW

• Christian Hennig
Posted July 26, 2012 at 11:30 am | Permalink

Eric: You assume implicitly that every Gaussian mixture component is a “cluster”. That’s legitimate but not unique. Identifying clusters with modes is another legitimate cluster concept that will give you different clusters.

• eric
Posted July 27, 2012 at 10:57 am | Permalink

I understand now, thanks

2. gallamine
Posted July 23, 2012 at 12:21 pm | Permalink

How is this method different from K-Means clustering?

• normaldeviate
Posted July 23, 2012 at 12:44 pm | Permalink

k-means find centers to minimize sums of squares.
This finds the modes of a density estimator.
Very different.

3. Jotaf
Posted July 25, 2012 at 12:58 pm | Permalink

I’m not sure why it’s not used as much in ML. In my opinion, the bandwidth parameter is much more intuitive and easier to set (robust too?) than the number of clusters, which feels like it will depend too much on the data to know in advance. I suspect it’s because its theoretical analysis requires different tools than people are used to.

My favorite variant is Quick Shift, because it offers more insights into the structure of the data, and has a less hacky implementation. Instead of setting w[t+1] to the maximum of the local density estimate, it is set to the *data point* that maximizes this density. This means that w[t] is always a data point. It may not seem like much of a difference, but it has huge implications.

First, you only need to compute w[t+1] once for each data point, instead of going through a number of iterations. The result will be a forest graph, since each data point (node) has an arc going to the next data point in the density. This is interesting because it reveals a hierarchical structure inside each mode; I’m not sure if there’s any work on this. And a simple graph traversal will associate each data point to a mode (a tree root), while Mean Shift would require a final thresholding step (due to roundoff errors in the final w[t]).

4. Tom Dietterich
Posted July 25, 2012 at 10:51 pm | Permalink

Perhaps because kernel density estimates break down in high dimensional spaces (or even in moderately dimensional spaces)? Most ML problems live in very high dimensional spaces. So while mean shift is great for images in 2-d or 3-d, it doesn’t help me much in 127-d SIFT space.

5. Peyman
Posted July 28, 2012 at 1:24 am | Permalink

Just a comment that might be of interest. Looking at the mean-shift from a different (regression) point of view, it is essentially a (discretized) diffusion equation. With a model y = z + error, say we wish to estimate z. Consider a gram matrix K constructed from the kernel k. The locally constant kernel smoother for z is
z_hat = W y where W = D^(-1) K, with D being the diagonal matrix that normalizes the rows of K to sum to 1. Then (what’s known in the vision and image processing community as) the diffusion iteration is z(k+1) = W z(k), which coincides with the definition of mean-shift. I’ve written about this regression/filtering interpretation in the paper below. Page 17 in particular is most relevant.

http://users.soe.ucsc.edu/~milanfar/publications/journal/ModernTour_FinalSubmission.pdf

• normaldeviate
Posted July 28, 2012 at 8:50 am | Permalink

Thanks for the reference
—LW

6. Arun Iyer
Posted July 28, 2012 at 12:16 pm | Permalink

I am coming across this for the first time. However, the behaviour looks oddly identical to Chinese Restaurant Process, Is there any connection?

• normaldeviate
Posted July 28, 2012 at 12:20 pm | Permalink

Not really, (except in the vague sense that
the Chinese restaurant process is a way
of doing nonparametric Bayesian density estimation).
—LW

• tengantsuu
Posted July 28, 2012 at 3:33 pm | Permalink

Thanks for the reply.

It just felt that there might be some connection where the bandwidth parameter is some analog of the concentration parameter in CRP. Also, it seemed that both behaved similarly, i.e points being assigned to clusters based on number of points it already had. I might be just reading too much into it. Thanks again.