Hunting for Manifolds

Hunting For Manifolds
Larry Wasserman

In this post I’ll describe one aspect of the problem of estimating a manifold, or manifold learning, as it is called.

Suppose we have data ${Y_1,\ldots, Y_n \in \mathbb{R}^D}$ and suppose that the data lie on, or close to, a manifold ${M}$ of dimension ${d< D}$. There are two different goals we might have in mind.

The first goal is to use the fact that the data lie near the manifold as a way of doing dimension reduction. The popular methods for solving this problem include isomap, local linear embedding, diffusion maps, and Laplacian eigenmaps among others.

The second goal is to actually estimate the manifold ${M}$.

I have had the good fortune to work with two groups of people on this problem. With the first group (Chris Genovese, Marco Perone-Pacifico, Isa Verdinelli and me) we have focused mostly on certain geometric aspects of the problem. With the second group (Sivaraman Balakrishnan, Don Sheehy, Aarti Singh, Alessandro Rinaldo and me) we have focused on topology.

It is this geometric problem (my work with the first group) that I will discuss here. In particular, I will focus on these two papers: here and here.

1. Statistical Model

To make some progress, we introduce a statistical model for the data. Here are two models.

Model I (Clutter Noise): With probability ${\pi}$ we draw ${Y_i}$ from a distribution ${G}$ supported on ${M}$. With probability ${1-\pi}$ we draw ${Y_i}$ from a uniform distribution on some compact set ${K}$. To be concrete, let’s take ${K = [0,1]^D}$. The distribution ${P}$ of ${Y_i}$ is

$\displaystyle P = (1-\pi) U + \pi G$

where ${U}$ is uniform on ${[0,1]^D}$.

Model II (Additive Noise): We draw ${X_i}$ from a distribution ${G}$ supported on ${M}$. Then we let ${Y_i = X_i + \epsilon_i}$ where ${\epsilon_i}$ is a ${D}$-dimensional Gaussian distribution. In this case, ${P}$ has density ${p}$ given by

$\displaystyle p(y) = \int_M \phi(y-z) dG(z)$

where ${\phi}$ is a Gaussian. We can also write this as

$\displaystyle P = G \star \Phi$

where ${\star}$ denotes convolution.

There are some interesting complications. First, note that ${G}$ is a singular distribution. That is, there are sets with positive probability but 0 Lebesgue measure. For example, suppose that ${M}$ is a circle in ${\mathbb{R}^2}$ and that ${G}$ is uniform on the circle. Then the circle has Lebesgue measure 0 but ${G(M)=1}$. Model II has the complication that it is a convolution and convolutions are not easy to deal with.

Our goal is to construct an estimate ${\hat M}$.

2. Hausdorff Distance

We will compare ${\hat M}$ to ${M}$ using the Hausdorff distance. If ${A}$ and ${B}$ are sets, the Hausdorff distance between ${A}$ and ${B}$ is

$\displaystyle H(A,B) = \inf \Bigl\{ \epsilon:\ A \subset B\oplus \epsilon,\ {\rm and}\ B \subset A\oplus \epsilon\Bigr\}$

where

$\displaystyle A\oplus \epsilon = \bigcup_{x\in A} B_D(x,\epsilon)$

and ${B_D(x,\epsilon)}$ is a ball of radius ${\epsilon}$ centered at ${x}$.

Let’s decode what this means. Imagine we start to grow the set ${A}$ by placing a ball of size ${\epsilon}$ around each point in ${A}$. We keep growing ${A}$ until it engulfs ${B}$. Let ${\epsilon_1}$ be the size of ${\epsilon}$ needed to do this. Now grow ${B}$ until it engulfs ${A}$. Let ${\epsilon_2}$ be the size of ${\epsilon}$ needed to do this. Then ${H(A,B) = \max\{\epsilon_1,\epsilon_2\}}$. Here is an example:

There is another way to think of the Hausdorff distance. If ${A}$ is a set and ${x}$ is a point, let ${d_A(x)}$ be the distance from ${x}$ to ${A}$:

$\displaystyle d_A(x) = \inf_{y\in A}||y-x||.$

We call ${d_A}$ the distance function. Then

$\displaystyle H(A,B) = \sup_x | d_A(x) - d_B(x)|.$

Using ${H}$ as our loss function, our goal is to find the minimax risk:

$\displaystyle R_n = \inf_{\hat M}\sup_{P\in {\cal P}} \mathbb{E}_P [ H(\hat M,M)]$

where the infimum is over all estimators ${\hat M}$ and the supremum is over a set of distributions ${{\cal P}}$. Actually, we will consider a large set of manifolds ${{\cal M}}$ and we will associate a distribution ${P}$ with each manifold ${M}$. This will define the set ${{\cal P}}$.

3. Reach

We will consider all ${d}$-dimensional manifolds ${M}$ with positive reach. The reach is a property of a manifold that appears in many places in manifold estimation theory. I believe it first appeared in Federer (1959).

The reach is ${\kappa(M)}$ is the largest number ${\epsilon}$ such that each point in ${M\oplus \epsilon}$ has a unique projection on ${M}$. A set of non-zero reach has two nice properties. First, it is smooth and second, it is self-avoiding.

Consider a circle ${M}$ in the plane with radius ${R}$ . Every point on the plane has a unique projection on ${M}$ with one exception: the center ${c}$ of the circle is distance ${R}$ from each point on ${M}$. There is no unique closest point on ${M}$ to ${c}$. So the reach is ${R}$.

A plane has infinite reach. A line with corner has 0 reach. Here is a example of a curve ${M}$ in two-dimensions. Suppose you put a line, perpendicular to ${M}$, at each point on ${M}$. If the length of the lines is less than reach(M), then the lines won’t cross. If the length of the lines is larger than reach(M), then the lines will cross.

Requiring ${M}$ to have positive reach is a strong condition. There are ways to weaken the condition based on generalization so reach, but we won’t go into those here.

The set of manifolds we are interested in is the set of ${d}$-manifolds contained in ${[0,1]^D}$ with reach at least ${\kappa >0}$. Here, ${\kappa}$ is some arbitrary, fixed positive number.

4. The Results

Recall that that the minimax risk is

$\displaystyle R_n = \inf_{\hat M}\sup_{P\in {\cal P}} \mathbb{E}_P [ H(\hat M,M)].$

In the case of clutter noise it turns out that

$\displaystyle R_n \approx \left(\frac{1}{n}\right)^{\frac{2}{d}}.$

Note that the rate depends on ${d}$ but not on the ambient dimension ${D}$. But in the case of additive Gaussian noise,

$\displaystyle R_n \approx \left(\frac{1}{\log n}\right).$

The latter result says that ${R_n}$ approaches 0 very slowly. For all practical purposes, it is not possible to estimate ${M}$. This might be surprising. But, in fact, the result is not unexpected.

We can explain this by analogy with another problem. Suppose you want to estimate ${p}$ from ${n}$ observations ${Y_1,\ldots, Y_n}$. if ${p}$ satisfies standard smoothness assumptions, then the minimax risk has the form

$\displaystyle \left(\frac{1}{n}\right)^{\frac{4}{4+d}}$

which is a nice, healthy, polynomial rate. But if we add some Gaussian noise to each observation, the minimax rate of convergence for estimating ${p}$ is logarithmic (Fan 1991). The same thing happens if we add Gaussian noise to the covariates in a regression problem. A little bit of Gaussian noise makes these problems essentially impossible.

Is there a way out of this? Yes. Instead of estimating ${M}$ we estimate a set ${M'}$ that is close to ${M}$. In other words, we have to live with some bias. We can estimate ${M'}$, which we call a surrogate for ${M}$, with a polynomial rate.

We are finishing a paper on this idea and I’ll write a post about it when we are done.

5. Conclusion

Estimating a manifold is a difficult statistical problem. With additive Gaussian noise, it is nearly impossible. And notice that I assumed that the dimension ${d}$ of the manifold was known. Things are even harder when ${d}$ is not known.

However, none of these problems are insurmountable as long as we chnage the goal from “estimating ${M}$” to “estimating an approximation of ${M}$” which, in practice, is usually quite reasonable.

6. References

Fan, J. (1991). On the optimal rates of convergence for nonparametric deconvolution problems. Ann. Statist. 19 1257-1272.

Federer, H. (1959). Curvature measures. Trans. Amer. Math. Soc. 93 418-491.

Genovese, C., Perone-Pacifico, M., Verdinelli, I. and Wasserman, L. (2012). Minimax Manifold Estimation. Journal of Machine Learning Research, 13, 1263–1291.

Genovese, C., Perone-Pacifico, M., Verdinelli, I. and Wasserman, L. (2012). Manifold estimation and singular deconvolution under Hausdorff loss. The Annals of Statistics, 49, 941-963.

1. Chris Saunders
Posted September 8, 2012 at 11:48 am | Permalink

Just wanted to let you know that I really enjoy your blog.

• Justin Scheiner
Posted September 8, 2012 at 2:35 pm | Permalink

Agreed! The blog has been mostly digestible by ordinary tech people (at least with one example to go off of). I had an easier time thinking of the stuff above in terms of dilation and erosion (like in computer vision), but since I was thinking of stuff as “growing” had to flip the definition of reach in my head (maximum \epsilon such that points do not have a unique projection).

Thanks!

• Posted September 8, 2012 at 1:06 pm | Permalink

Thanks!

2. kenmccue
Posted September 8, 2012 at 12:39 pm | Permalink

How does an approach such as given in M. Soltanolkotabi and E. J. Candès. A geometric analysis of subspace clustering with outliers. compare with the approaches here?

• Posted September 8, 2012 at 1:08 pm | Permalink

If I remember correctly, they are looking for clusters
around linear subspaces.
(But I would need to check)

3. agoldensh
Posted September 8, 2012 at 2:59 pm | Permalink

Larry, closely related two-dimensional problem was considered in the paper “Estimating trajectories” by Y. Chow in Annals of Statistics in 1986…

• Posted September 8, 2012 at 5:56 pm | Permalink

Thanks for the reference
LW

4. Posted September 8, 2012 at 5:47 pm | Permalink

…just wondering if those slow rates have anything to do with the metric…I mean, Hausdorff is a good guy but it seems Wasserstein is gaining momentum…

–Pierp

• Posted September 8, 2012 at 5:54 pm | Permalink

Yes I think you are right.
The rate in Wasserstein would probably be faster.
LW

5. Jay
Posted September 9, 2012 at 3:01 pm | Permalink

I’m wondering if you can provide a reference for your “errors in variables” comment on adding Gaussian noise to covariates. Thanks.