The mean shift algorithm is a mode-based clustering method due to Fukunaga and Hostetler (1975) that is commonly used in computer vision but seems less well known in statistics.
The steps are: (1) estimate the density, (2) find the modes of the density, (3) associate each data point to one mode.
1. The Algorithm
We are given data which are a sample from a density
. Let
be a kernel density estimator with kernel and bandwidth
. Let
denote the gradient of
.
Suppose that has modes
. (Note that
is not determined in advance; it is simply the number of modes in the density estimator.) An arbitrary point
is assigned to mode
if the gradient ascent curve through
leads to
. More formally, the gradient
defines a family of integral curves
. An integral curve
is defined by the differential equation
In words, moving along corresponds to moving up in the direction of the gradient. The integral curves partition the space.
For each mode , define the set
to be the set of all
such that the integral curve starting at
leads to
. Intuitively, if we place a particle at
and let it flow up
, it will end up at
. That is, if
is the integral curve starting at
, then
. The sets
form a partition. We can now partition the data according to which sets
they fall into. In other words, we define clusters
where
.
To find the clusters, we only need to start at each , move
along its gradient ascent curve
and see which mode it goes to. This is where the mean shift algorithm comes in.
The mean shift algorithm approximates the integral curves (i.e. the gradient ascent curves). Although it is usually applied to the data , it can be applied to any set of points. Suppose we pick some point
and we want to find out which mode it belongs to. We set
and then, for
, set
In other words, replace with a weighted average around
and repeat. The trajectory
converges to a mode
.
In practice, we run the algorithm, not on some arbitrary but on each data point
. Thus, by running the algorithm you accomplish two things: you find the modes and you find which mode each data point
goes to. Hence, you find the mode-based clusters.
The algorithm can be slow, but the literature is filled with papers with speed-ups, improvements etc.
Here is a simple example:
The big black dots are data points. The blue crosses are the modes. The red curves are the mean shift paths. Pretty, pretty cool.
2. Some Theory
Donoho and Liu (1991) showed that, if behaves in a locally quadratic fashion around the modes, then the minimax rate for estimating modes is
and that the rate is achieved by kernel density estimators. Thus, the above procedure has some theoretical foundation. (Their result s for
but I am pretty sure it extends to
.)
More recently, Klemel{ä} (2005) showed that it is possible to obtain minimax estimators of the mode that adapt to the local behavior of around the modes. He uses a method due to Lepski (1992) that is a fairly general way to do adaptive inference.
My colleagues (Chris Genovese, Marco Perone-Pacifico and Isa Verdinelli) and I have been working on some extensions to the mean shift algorithm. I’ll report on that when we finish the paper.
I’ll close with a question: why don’t we use this algorithm more in statistical machine learning?
3. References
Cheng, Yizong (1995). Mean Shift, Mode Seeking, and Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17, 790-799.
Donoho, D.L. and Liu, R.C. (1991). Geometrizing rates of convergence, II. The Annals of Statistics, 633-667.
Comaniciu, Dorin; Peter Meer (2002). Mean Shift: A Robust Approach Toward Feature Space Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 603-619.
Fukunaga, Keinosuke; Larry D. Hostetler (1975). The Estimation of the Gradient of a Density Function, with Applications in Pattern Recognition. IEEE Transactions on Information Theory, 21, 32-40.
Klemel{ä}, J. (2005). Adaptive estimation of the mode of a multivariate density. Journal of Nonparametric Statistics, 17, 83-105.
Lepski, O.V. (1992). On problems of adaptive estimation in white Gaussian noise. Topics in nonparametric estimation, 12, 87-106.
—Larry Wasserman
14 Comments
For gaussian mixture models, there can be more modes in the density than the number of clusters. Donohos assumption is that this won’t happen, but can this assumption become problematic?
Donoho does not assume a mixture model.
Just a density function with locally quadratic modes.
–LW
Eric: You assume implicitly that every Gaussian mixture component is a “cluster”. That’s legitimate but not unique. Identifying clusters with modes is another legitimate cluster concept that will give you different clusters.
I understand now, thanks
How is this method different from K-Means clustering?
k-means find centers to minimize sums of squares.
This finds the modes of a density estimator.
Very different.
I’m not sure why it’s not used as much in ML. In my opinion, the bandwidth parameter is much more intuitive and easier to set (robust too?) than the number of clusters, which feels like it will depend too much on the data to know in advance. I suspect it’s because its theoretical analysis requires different tools than people are used to.
My favorite variant is Quick Shift, because it offers more insights into the structure of the data, and has a less hacky implementation. Instead of setting w[t+1] to the maximum of the local density estimate, it is set to the *data point* that maximizes this density. This means that w[t] is always a data point. It may not seem like much of a difference, but it has huge implications.
First, you only need to compute w[t+1] once for each data point, instead of going through a number of iterations. The result will be a forest graph, since each data point (node) has an arc going to the next data point in the density. This is interesting because it reveals a hierarchical structure inside each mode; I’m not sure if there’s any work on this. And a simple graph traversal will associate each data point to a mode (a tree root), while Mean Shift would require a final thresholding step (due to roundoff errors in the final w[t]).
Thanks for the comment.
For interested readers, here is a link to Quick Shift:
Click to access vedaldiS08quick.pdf
Perhaps because kernel density estimates break down in high dimensional spaces (or even in moderately dimensional spaces)? Most ML problems live in very high dimensional spaces. So while mean shift is great for images in 2-d or 3-d, it doesn’t help me much in 127-d SIFT space.
Just a comment that might be of interest. Looking at the mean-shift from a different (regression) point of view, it is essentially a (discretized) diffusion equation. With a model y = z + error, say we wish to estimate z. Consider a gram matrix K constructed from the kernel k. The locally constant kernel smoother for z is
z_hat = W y where W = D^(-1) K, with D being the diagonal matrix that normalizes the rows of K to sum to 1. Then (what’s known in the vision and image processing community as) the diffusion iteration is z(k+1) = W z(k), which coincides with the definition of mean-shift. I’ve written about this regression/filtering interpretation in the paper below. Page 17 in particular is most relevant.
Click to access ModernTour_FinalSubmission.pdf
Thanks for the reference
—LW
I am coming across this for the first time. However, the behaviour looks oddly identical to Chinese Restaurant Process, Is there any connection?
Not really, (except in the vague sense that
the Chinese restaurant process is a way
of doing nonparametric Bayesian density estimation).
—LW
Thanks for the reply.
It just felt that there might be some connection where the bandwidth parameter is some analog of the concentration parameter in CRP. Also, it seemed that both behaved similarly, i.e points being assigned to clusters based on number of points it already had. I might be just reading too much into it. Thanks again.
2 Trackbacks
[…] post do Normal Deviate apresenta de uma maneira bem técnica o algoritmo que faz a distribuição da Mean-Shift (algo como […]
[…] The Amazing Mean Shift Algorithm by Larry Wasserman. […]