Topological Inference

We uploaded a paper called Statistical Inference For Persistent Homology on arXiv. (I posted about topological data analysis earlier here.) The paper is written with Siva Balakrishnan, Brittany Fasy, Fabrizio Lecci, Alessandro Rinaldo and Aarti Singh.

The basic idea is this. We observe data {{\cal D} = \{X_1,\ldots, X_n\}} where {X_1,\ldots, X_n \sim P} and {P} is supported on a set {\mathbb{M}}. We want to infer topological features of {\mathbb{M}}. For example, here are data from a donut:


The top left plot is the set {\mathbb{M}}. The top right plot shows a sample from a distribution on {\mathbb{M}}. The bottom left plot is a union of balls around the data. The bottom right is something called a Čech complex which we can ignore here. The donut has one connected component and one hole. Can we recover this information from the data? Consider again the bottom left plot which shows

\displaystyle  \hat S_\epsilon = \bigcup_{i=1}^n B(X_i,\epsilon)

where {B(X_i,\epsilon)} is a ball of size {\epsilon} centered at {X_i}. When {\epsilon} is just right, {\hat S_\epsilon} will have one connected component and one hole, just like {\mathbb{M}}. Here are plots of {\hat S_\epsilon = \bigcup_{i=1}^n B(X_i,\epsilon)} for various values of {\epsilon}:


As we vary {\epsilon}, topological features (connected components, holes etc) will be born and then die. Each feature can be represented by a pair {z=({\rm birth},{\rm death})} indicating the birth time and death time of that feature. A persistence diagram {{\cal P}} is a plot of the points {z}. Small features will have death times close to their birth times. In other words, small features correspond to a point {z} near the diagonal. Here is an example of a persistence diagram:


Informally, topological noise correspond to points near the diagonal and topological signal corresponds to points far from the diagonal. The goal of our paper is to separate noise from signal, in a statistical sense. Here are some details.

The distance function to {\mathbb{M}} is

\displaystyle  d_{\mathbb{M}}(x) = \inf_{y\in \mathbb{M}} ||x-y||

and the distance function to the data is

\displaystyle  \hat d(x) = \inf_{X_i\in {\cal D}} ||x-X_i||.

The persistence diagram {{\cal P}} of {\mathbb{M}} represents the evolution of the topology of the lower level sets {\{x:\ d_{\mathbb{M}}(x)\leq \epsilon\}} as a function of {\epsilon}. The estimate {\hat{\cal P}} of {{\cal P}} is the persistence diagram based on the lower level sets of {\hat d}, that is, {\{x:\ \hat d(x) \leq \epsilon\}.} But notice that {\{x:\ \hat d(x) \leq \epsilon\}} is precisely equal to {\hat S_\epsilon}:

\displaystyle  \hat S_\epsilon = \{x:\ \hat d(x) \leq \epsilon\}.

To make inferences about {{\cal P}} from {\hat{\cal P}} we need to infer how far apart these two diagrams are. The bottleneck distance {W_\infty(\hat{\cal P},{\cal P})} measures the distance between the estimated and true persistence diagram. Basically, this measures how much we have to move the points in {\hat{\cal P}} to match the points in {{\cal P}} as in this plot:


Our paper shows several ways to construct a bound {c_n = c_n(X_1,\ldots, X_n)} so that

\displaystyle  \limsup_{n\rightarrow\infty} \mathbb{P}(W_\infty(\hat{\cal P},{\cal P}) > c_n) \leq \alpha.

Then we add a strip of size {c_n} (actually, of size {\sqrt{2}c_n}) to the persistence diagram. Points in the strip are declared to be “noise” as in this example:


To bound {W_\infty(\hat{\cal P},{\cal P})} we use the fact that {W_\infty(\hat{\cal P},{\cal P}) \leq H({\cal D},\mathbb{M})} where {H} is the Hausdorff distance which is defined by

\displaystyle  H(A,B) = \inf \Bigl\{ \epsilon:\ A \subset B \oplus \epsilon\ \ \ {\rm and}\ \ \ B \subset A \oplus \epsilon\Bigr\}


\displaystyle  A \oplus \epsilon = \bigcup_{x\in A}B(x,\epsilon)

and {B(x,\epsilon)} is a ball of size {\epsilon} centered at {x}.

To summarize: to find a confidence interval for {W_\infty(\hat{\cal P},{\cal P})} it suffices to get a confidence interval for {H({\cal D},\mathbb{M})}.

One approach for bounding {H({\cal D},\mathbb{M})} is subsampling. We draw random subsamples, {\Omega_1,\ldots,\Omega_N}, each of size {b}, from the original data. Here {b=b_n} satisfies {b_n = o(n)} and {b_n \rightarrow \infty} as {n\rightarrow\infty}. Then we compute

\displaystyle  T_j = H(\Omega_j,{\cal D}),\ \ \ j=1,\ldots, N.

Now we compute

\displaystyle  L(t) = \frac{1}{N}\sum_{j=1}^N I(T_j > t).

Finally, we get the upper quantile of this distribution, {c_n= 2L^{-1}(\alpha)}. (The factor of 2 is for technical reasons). This gives us what we want. Specifically,

\displaystyle  \mathbb{P}(W_\infty(\hat{\cal P}, {\cal P}) > c_n) \leq \alpha+ O\left(\sqrt{ \frac{b \log n}{n}}\right) + \frac{2^d}{n\log n} \approx \alpha

where {d} is the dimension of the set {\mathbb{M}}.

The paper contains several other methods for bounding {W_\infty(\hat{\cal P},{\cal P})}. Well, I hope I have piqued your interest. If so, have a look at the paper and let us know if you have any comments.

P.S. Sebastien Bubeck has a new blog, here


  1. kenmccue
    Posted April 10, 2013 at 11:30 pm | Permalink

    Any software available for estimating these models?

    • Posted April 11, 2013 at 7:31 am | Permalink

      Yes. Plex and Dionysus are the ones I know of.
      Also, the phom package in R

  2. Matías Schrauf
    Posted April 16, 2013 at 9:39 pm | Permalink

    Very interesting post, thanks for writing it. For the moment, it seems there is more theory than applications, but that may change soon (or be a misperception originated from my ignorance of the topic.
    In the paper “Stability of Persistence Diagrams” (Cohen-Steiner et al 2005), they work with the persistent homologies of the sub-level sets of functions, where a change in the homology indicates a critical point of the function. That is a kind of persistent homology which I think I could use. Does anybody know any software (or R package) appropiate for calculating such homologies? If there is any, what kind of description do they use for functions?
    They are related to topics covered in other posts of this blog (density based clusters and CI for the # of modes of a distribution).

  3. Posted April 20, 2013 at 4:12 pm | Permalink

    We are currently using Dionysus ( ).
    I will shortly write some instructions on how to use Dionysus within R.
    The basic idea is:
    – Determine the values of your function over a grid of points
    – Dionysus triangulates the space and returns the homology of lower (or upper) level sets of the function

    Stay tuned

%d bloggers like this: