We uploaded a paper called Statistical Inference For Persistent Homology on arXiv. (I posted about topological data analysis earlier here.) The paper is written with Siva Balakrishnan, Brittany Fasy, Fabrizio Lecci, Alessandro Rinaldo and Aarti Singh.
The basic idea is this. We observe data where and is supported on a set . We want to infer topological features of . For example, here are data from a donut:
The top left plot is the set . The top right plot shows a sample from a distribution on . The bottom left plot is a union of balls around the data. The bottom right is something called a Čech complex which we can ignore here. The donut has one connected component and one hole. Can we recover this information from the data? Consider again the bottom left plot which shows
where is a ball of size centered at . When is just right, will have one connected component and one hole, just like . Here are plots of for various values of :
As we vary , topological features (connected components, holes etc) will be born and then die. Each feature can be represented by a pair indicating the birth time and death time of that feature. A persistence diagram is a plot of the points . Small features will have death times close to their birth times. In other words, small features correspond to a point near the diagonal. Here is an example of a persistence diagram:
Informally, topological noise correspond to points near the diagonal and topological signal corresponds to points far from the diagonal. The goal of our paper is to separate noise from signal, in a statistical sense. Here are some details.
The distance function to is
and the distance function to the data is
The persistence diagram of represents the evolution of the topology of the lower level sets as a function of . The estimate of is the persistence diagram based on the lower level sets of , that is, But notice that is precisely equal to :
To make inferences about from we need to infer how far apart these two diagrams are. The bottleneck distance measures the distance between the estimated and true persistence diagram. Basically, this measures how much we have to move the points in to match the points in as in this plot:
Our paper shows several ways to construct a bound so that
Then we add a strip of size (actually, of size ) to the persistence diagram. Points in the strip are declared to be “noise” as in this example:
To bound we use the fact that where is the Hausdorff distance which is defined by
and is a ball of size centered at .
To summarize: to find a confidence interval for it suffices to get a confidence interval for .
One approach for bounding is subsampling. We draw random subsamples, , each of size , from the original data. Here satisfies and as . Then we compute
Now we compute
Finally, we get the upper quantile of this distribution, . (The factor of 2 is for technical reasons). This gives us what we want. Specifically,
where is the dimension of the set .
The paper contains several other methods for bounding . Well, I hope I have piqued your interest. If so, have a look at the paper and let us know if you have any comments.
P.S. Sebastien Bubeck has a new blog, here