Topological data analysis (TDA) is a relatively new area of research that spans many disciplines including topology (in particular, homology), statistics, machine learning and computation geometry.
The basic idea of TDA is to describe the “shape of the data” by finding clusters, holes, tunnels, etc. Cluster analysis is special case of TDA. I’m not an expert on TDA but I do find it fascinating. I’ll try to give a flavor of what this subject is about.
1. Cycles and Holes
Let us start with clusters. Consider data where each is a point in . One way to define clusters is to define a graph over the data. We fix a tuning parameter and we put an edge between two data points and if . We can then define the clusters to be the connected components of this graph.
This type of clustering is identical to single linkage clustering.
There are problems with single linkage clustering; for example, single linkage cluster are inconsistent estimates of high density regions; see Hartigan (1981). So this should raise some alarm bells. But let’s ignore that for now.
To extract more topological information— in particular, to get the homology groups— we need to do some more work. In addition to the edges of the graph, we will include triangles and higher order simplices. Look at the following example:
The top left shows some data. We see, intuitively, that there is one “hole” in the data. Also, shown is the Čech complex, described below. Basically this consists of 0-simplices (the points), the 1-simplices (the edges) and a 2-simplex (a triangle). The top right and bottom left show two “cycles” (or loops) through the data. Both cycles show the same hole in the data. To capture the idea that both cycles capture the same hole we need to regard these cycles as equivalent. This means that we can deform one cycle into the other. More formally, the difference between the two cycles is a “filled-in cycle,” that is, it is the boundary of a higher-order object. Here is another example of two equivalent cycles based on a Figure 4.9 from Zomorodian 2005: (Afra Zomorodian, 2005, Topology for computing.
2. Making It Formal: The Čech Complex
Think of each data point as a 0-dimensional simplex. Let denote these simplices. Think of each edge in the graph as a 1-dimensional simplex. Let denote these simplices (i.e. the edges). (We only include an edge if .) Now we form a set of 2-dimensional simplices (triangles) as follows. If three points are such that then we include the simplex defined by these three points, denoted by , in . Here, is a ball of radius centered at .
Now keep going and form the 3-dimensional simplices , the 4-dimensional simplices , and so on. (Note that a simplex is defined by points.)
The collection of simplices has a special structure. In particular, if then any face of is also in . A class of simplices with this property is called a simplicial complex. This particular simplicial complex is called a Čech complex.
The Čech complex contains topological information. To see this, we need a few more definitions. Specifically, we introduce chains which are sets of -dimensional simples, cycles which are chains without boundary (cycles), and boundary cycles which are filled in cycles, that is, they are cycles but they are also boundaries of higher order chains. This leads to the following key picture of homology:
A chain of order is any set of -simplices. Formally, these are written as a sum. For example,
is a chain consisting of three, one-dimensional simplices (edges). Intuitively, you can think of the sum as taking the set difference. Thus . The chains are a group under addition.
The boundary of a simplex is obtained by removing each element one at a time and adding the result. For example, the boundary of the triangle is
Thus, the triangle has an empty 2-boundary.
**[Edit]** As pointed out below by Ramsay, this is an error. Need to apply after applying and then
we get the empty set.
More generally, the boundary of a -simplex is
where means that has been removed. The boundary of a chain is defined to be .
A chain with an empty boundary is a cycle. The set of cycles is denoted by . Formally, cycles are the kernel of boundary map: .
Some elements of are “filled in cycles.” Formally, these boundary cycles are boundaries of chains in . That is, the boundary cycles are the image of : . It may be verified that
This yields the key picture above.
Now two cycles and are considered equivalent if they differ by an element of . That is, if for some . The set equivalence classes is the homology group. Formally, . (Think of the two cycles in the first example defining the same hole.) The order of the group is called the Betti number and is denoted by .
How do we compute the groups and the Betti numbers? All of the above can be expressed as matrices. Basically, the matrices capture which simplices are boundaries of which other simplices. This is the appeal of homology: it can be expressed as linear algebra. The Matlab package “Plex” will take a set of points and compute all the stuff we have discussed.
In case your getting confused, let’s step back and summarize the main point: is the number of connected components. is the number of holes. is the number of tunnels. etc.
Notice that there is a tuning parameter . The conclusions depend crucially on . How do we choose it? Usually, we don’t. Instead, one displays the topological information as a function of . This can be done in a barcode plot as follows: (this is Figure 4 from Ghrist (2008)):
Source: Figure 4, from Ghrist (2008), Barcodes: The persistent topology of data, Bulletin-American Mathematical Society, 45, page 61.
What we see here is that as varies, some topological features appear and then disappear. The features that persist for a long stretch are considered to be real; the small ones are called topological noise. This type of analysis is called persistent homology.
4. What Are We Actually Estimating?
Suppose the data were obtained as follows. First we sample data from a distribution which is supported on some manifold . We don’t observe . Instead, we observe where represent random noise.
The manifold has homology groups defined in a way similarly to how we defined the groups for the data. The hope is that the homology from the data estimates the homology of . The implicit belief seems to be that persistent features (long bars in the barcode plot) are real features of the underlying manifold.
What’s known more precisely is that, under certain conditions, and for some choices of , the Čech complex has high probability of having the right homology. See, for example, Niyogi, Smale and Weinberger (2011). There are other ways to try to estimate the homology of ; see for example, Caillerie (et al 2012).
In fact, the data need to be cleaned first by removing low density points. This means we need to first do some sort of density estimation. (This relates to my earlier comment about the lack of consistency of single linkage clustering.) So there are in fact many tuning parameters and practical ways to choose appropriate tuning parameters still seem elusive. This is an example of what I like to call the curse of tuning parameters.
5. What Do We Use It For?
TDA is fascinating. But is it useful? One can find many examples of TDA in genomics, neuroscience and image analysis, for example,
But is TDA crucial here or is it a hammer looking for a nail? After all, the most successful data analysis tools are simple and homology is pretty complicated.
To me, the jury is still out. I find TDA interesting but whether it will prove to be an enduring set of data analysis methods remains to be seen. In the meantime, I hope people will keep thinking about it.
Edelsbrunner, H. and Harer, J. (2010). Computational Topology: An Introduction.
Zomorodian, Afra. (2005). Topology for Computing.
Balakrishnan, S., Rinaldo, A., Sheehy,D., Singh,A. and Wasserman, L. (2011). Minimax Rates for Homology Inference.
Ghrist (2008), Barcodes: The persistent topology of data, Bulletin-American Mathematical Society, 45, page 61.
Y. Dabaghian, F. Mémoli, L. Frank, G. Carlsson. (2012). A topological paradigm for hippocampal spatial map formation using persistent homology. PLoS Computational Biology. link.
Niyogi, P. and Smale, S. and Weinberger, S. (2011). A topological view of unsupervised learning from noisy data. SIAM Journal on Computing, 40, p 646.
Caillerie, Chazal, Dedecker, and Michel. (2012). Deconvolution for the Wasserstein metric and geometric inference. Electronic Journal of Statistics, 5, 1394–1423. link.
— Larry Wasserman