We uploaded a paper called *Statistical Inference For Persistent Homology* on arXiv. (I posted about topological data analysis earlier here.) The paper is written with Siva Balakrishnan, Brittany Fasy, Fabrizio Lecci, Alessandro Rinaldo and Aarti Singh.

The basic idea is this. We observe data where and is supported on a set . We want to infer topological features of . For example, here are data from a donut:

The top left plot is the set . The top right plot shows a sample from a distribution on . The bottom left plot is a union of balls around the data. The bottom right is something called a Čech complex which we can ignore here. The donut has one connected component and one hole. Can we recover this information from the data? Consider again the bottom left plot which shows

where is a ball of size centered at . When is just right, will have one connected component and one hole, just like . Here are plots of for various values of :

As we vary , topological features (connected components, holes etc) will be born and then die. Each feature can be represented by a pair indicating the birth time and death time of that feature. A *persistence diagram* is a plot of the points . Small features will have death times close to their birth times. In other words, small features correspond to a point near the diagonal. Here is an example of a persistence diagram:

Informally, topological noise correspond to points near the diagonal and topological signal corresponds to points far from the diagonal. The goal of our paper is to separate noise from signal, in a statistical sense. Here are some details.

The *distance function* to is

and the distance function to the data is

The persistence diagram of represents the evolution of the topology of the lower level sets as a function of . The estimate of is the persistence diagram based on the lower level sets of , that is, But notice that is precisely equal to :

To make inferences about from we need to infer how far apart these two diagrams are. The *bottleneck distance* measures the distance between the estimated and true persistence diagram. Basically, this measures how much we have to move the points in to match the points in as in this plot:

Our paper shows several ways to construct a bound so that

Then we add a strip of size (actually, of size ) to the persistence diagram. Points in the strip are declared to be “noise” as in this example:

To bound we use the fact that where is the Hausdorff distance which is defined by

where

and is a ball of size centered at .

To summarize: to find a confidence interval for it suffices to get a confidence interval for .

One approach for bounding is subsampling. We draw random subsamples, , each of size , from the original data. Here satisfies and as . Then we compute

Now we compute

Finally, we get the upper quantile of this distribution, . (The factor of 2 is for technical reasons). This gives us what we want. Specifically,

where is the dimension of the set .

The paper contains several other methods for bounding . Well, I hope I have piqued your interest. If so, have a look at the paper and let us know if you have any comments.

P.S. Sebastien Bubeck has a new blog, here

## 6 Comments

Any software available for estimating these models?

Yes. Plex and Dionysus are the ones I know of.

Also, the phom package in R

Very interesting post, thanks for writing it. For the moment, it seems there is more theory than applications, but that may change soon (or be a misperception originated from my ignorance of the topic.

In the paper “Stability of Persistence Diagrams” (Cohen-Steiner et al 2005), they work with the persistent homologies of the sub-level sets of functions, where a change in the homology indicates a critical point of the function. That is a kind of persistent homology which I think I could use. Does anybody know any software (or R package) appropiate for calculating such homologies? If there is any, what kind of description do they use for functions?

They are related to topics covered in other posts of this blog (density based clusters and CI for the # of modes of a distribution).

We are currently using Dionysus ( http://www.mrzv.org/software/dionysus/ ).

I will shortly write some instructions on how to use Dionysus within R.

The basic idea is:

– Determine the values of your function over a grid of points

– Dionysus triangulates the space and returns the homology of lower (or upper) level sets of the function

Stay tuned

That’s great

Detailed instructions:

http://www.stat.cmu.edu/~flecci/research/dionysus.html