## Topological Data Analysis

Topological data analysis (TDA) is a relatively new area of research that spans many disciplines including topology (in particular, homology), statistics, machine learning and computation geometry.

The basic idea of TDA is to describe the “shape of the data” by finding clusters, holes, tunnels, etc. Cluster analysis is special case of TDA. I’m not an expert on TDA but I do find it fascinating. I’ll try to give a flavor of what this subject is about.

1. Cycles and Holes

Let us start with clusters. Consider data ${Y_1,\ldots,Y_n}$ where each ${Y_i}$ is a point in ${\mathbb{R}^d}$. One way to define clusters is to define a graph over the data. We fix a tuning parameter ${\epsilon}$ and we put an edge between two data points ${Y_i}$ and ${Y_j}$ if ${||Y_i - Y_j|| \leq \epsilon}$. We can then define the clusters to be the connected components of this graph.

This type of clustering is identical to single linkage clustering.

There are problems with single linkage clustering; for example, single linkage cluster are inconsistent estimates of high density regions; see Hartigan (1981). So this should raise some alarm bells. But let’s ignore that for now.

To extract more topological information— in particular, to get the homology groups— we need to do some more work. In addition to the edges of the graph, we will include triangles and higher order simplices. Look at the following example:

The top left shows some data. We see, intuitively, that there is one “hole” in the data. Also, shown is the Čech complex, described below. Basically this consists of 0-simplices (the points), the 1-simplices (the edges) and a 2-simplex (a triangle). The top right and bottom left show two “cycles” (or loops) through the data. Both cycles show the same hole in the data. To capture the idea that both cycles capture the same hole we need to regard these cycles as equivalent. This means that we can deform one cycle into the other. More formally, the difference between the two cycles is a “filled-in cycle,” that is, it is the boundary of a higher-order object. Here is another example of two equivalent cycles based on a Figure 4.9 from Zomorodian 2005: (Afra Zomorodian, 2005, Topology for computing.

2. Making It Formal: The Čech Complex

Think of each data point as a 0-dimensional simplex. Let ${S_0=\{ [Y_1], \ldots, [Y_n]\}}$ denote these simplices. Think of each edge in the graph as a 1-dimensional simplex. Let ${S_1}$ denote these simplices (i.e. the edges). (We only include an edge if ${||Y_i - Y_j|| \leq \epsilon}$.) Now we form a set ${S_2}$ of 2-dimensional simplices (triangles) as follows. If three points ${Y_i,Y_j,Y_k}$ are such that ${B(Y_i,\epsilon)\cap B(Y_j,\epsilon)\cap B(Y_k,\epsilon)\neq \emptyset}$ then we include the simplex defined by these three points, denoted by ${[Y_i,Y_j,Y_k]}$, in ${S_2}$. Here, ${B(Y_i,\epsilon)}$ is a ball of radius ${\epsilon}$ centered at ${Y_i}$.

Now keep going and form the 3-dimensional simplices ${S_3}$, the 4-dimensional simplices ${S_4}$, and so on. (Note that a ${k}$ simplex is defined by ${k+1}$ points.)

The collection ${C_\epsilon}$ of simplices ${\{S_0,S_1, \ldots, S_{n-1}\}}$ has a special structure. In particular, if ${\sigma\in C_\epsilon}$ then any face of ${\sigma}$ is also in ${C_\epsilon}$. A class of simplices with this property is called a simplicial complex. This particular simplicial complex is called a Čech complex.

The Čech complex contains topological information. To see this, we need a few more definitions. Specifically, we introduce chains ${C_p}$ which are sets of ${p}$-dimensional simples, cycles ${Z_p}$ which are chains without boundary (cycles), and boundary cycles ${B_p}$ which are filled in cycles, that is, they are cycles but they are also boundaries of higher order chains. This leads to the following key picture of homology:

A chain of order ${k}$ is any set of ${k}$-simplices. Formally, these are written as a sum. For example,

$\displaystyle [Y_1,Y_2] + [Y_2,Y_3] + [Y_3,Y_4]$

is a chain consisting of three, one-dimensional simplices (edges). Intuitively, you can think of the sum as taking the set difference. Thus ${[Y_1,Y_2] + [Y_2,Y_3] =[Y_1,Y_3]}$. The chains ${C_p}$ are a group under addition.

The boundary of a simplex ${\sigma}$ is obtained by removing each element one at a time and adding the result. For example, the boundary of the triangle ${[Y_1,Y_2,Y_3]}$ is

$\displaystyle \partial_2 [Y_1,Y_2,Y_3] = [Y_2,Y_3] + [Y_1,Y_3] + [Y_1,Y_2] = [Y_2,Y_1] + [Y_1,Y_2] = \emptyset.$

Thus, the triangle has an empty 2-boundary.

**** As pointed out below by Ramsay, this is an error. Need to apply $\partial_1$ after applying $\partial_2$ and then
we get the empty set.

More generally, the boundary of a ${k}$-simplex ${\sigma = [Y_1,\ldots, Y_k]}$ is

$\displaystyle \partial_k [Y_1,\ldots, Y_k]= \sum_j [Y_1,\ldots, \hat{Y}_j,\ldots, Y_k]$

where ${\hat Y_j}$ means that ${Y_j}$ has been removed. The boundary of a chain ${C = \sum_j \sigma_j}$ is defined to be ${\partial_k C = \sum_j \partial_k \sigma_j}$.

A chain with an empty boundary is a cycle. The set of cycles is denoted by ${Z_p}$. Formally, cycles are the kernel of boundary map: ${Z_p = {\sf ker}\partial_p}$.

Some elements of ${Z_p}$ are “filled in cycles.” Formally, these boundary cycles are boundaries of chains in ${C_{p+1}}$. That is, the boundary cycles ${B_p \subset Z_p}$ are the image of ${\partial_{p+1}}$: ${B_p = {\sf im}\partial_{p+1}}$. It may be verified that

$\displaystyle \partial_{p-1}\partial_p C_p = \emptyset.$

This yields the key picture above.

Now two cycles ${z}$ and ${z'}$ are considered equivalent if they differ by an element of ${B_p}$. That is, ${z\sim z'}$ if ${z = z' + b}$ for some ${b\in B_P}$. The set equivalence classes ${H_p}$ is the homology group. Formally, ${H_p = Z_p/C_p}$. (Think of the two cycles in the first example defining the same hole.) The order of the group is called the Betti number and is denoted by ${\beta_p}$.

How do we compute the groups and the Betti numbers? All of the above can be expressed as matrices. Basically, the matrices capture which simplices are boundaries of which other simplices. This is the appeal of homology: it can be expressed as linear algebra. The Matlab package “Plex” will take a set of points and compute all the stuff we have discussed.

In case your getting confused, let’s step back and summarize the main point: ${\beta_0}$ is the number of connected components. ${\beta_1}$ is the number of holes. ${\beta_2}$ is the number of tunnels. etc.

3. Persistence

Notice that there is a tuning parameter ${\epsilon}$. The conclusions depend crucially on ${\epsilon}$. How do we choose it? Usually, we don’t. Instead, one displays the topological information as a function of ${\epsilon}$. This can be done in a barcode plot as follows: (this is Figure 4 from Ghrist (2008)):

Source: Figure 4, from Ghrist (2008), Barcodes: The persistent topology of data, Bulletin-American Mathematical Society, 45, page 61.

What we see here is that as ${\epsilon}$ varies, some topological features appear and then disappear. The features that persist for a long stretch are considered to be real; the small ones are called topological noise. This type of analysis is called persistent homology.

4. What Are We Actually Estimating?

Suppose the data ${Y_1,\ldots, Y_n}$ were obtained as follows. First we sample data ${X_1,\ldots, X_n}$ from a distribution ${G}$ which is supported on some manifold ${M}$. We don’t observe ${X_i}$. Instead, we observe ${Y_i = X_i + \delta_i}$ where ${\delta_1,\ldots,\delta_n}$ represent random noise.

The manifold ${M}$ has homology groups defined in a way similarly to how we defined the groups for the data. The hope is that the homology from the data estimates the homology of ${M}$. The implicit belief seems to be that persistent features (long bars in the barcode plot) are real features of the underlying manifold.

What’s known more precisely is that, under certain conditions, and for some choices of ${\epsilon}$, the Čech complex has high probability of having the right homology. See, for example, Niyogi, Smale and Weinberger (2011). There are other ways to try to estimate the homology of ${M}$; see for example, Caillerie (et al 2012).

In fact, the data need to be cleaned first by removing low density points. This means we need to first do some sort of density estimation. (This relates to my earlier comment about the lack of consistency of single linkage clustering.) So there are in fact many tuning parameters and practical ways to choose appropriate tuning parameters still seem elusive. This is an example of what I like to call the curse of tuning parameters.

5. What Do We Use It For?

TDA is fascinating. But is it useful? One can find many examples of TDA in genomics, neuroscience and image analysis, for example,

But is TDA crucial here or is it a hammer looking for a nail? After all, the most successful data analysis tools are simple and homology is pretty complicated.

To me, the jury is still out. I find TDA interesting but whether it will prove to be an enduring set of data analysis methods remains to be seen. In the meantime, I hope people will keep thinking about it.

6. Refefences

Edelsbrunner, H. and Harer, J. (2010). Computational Topology: An Introduction.

Zomorodian, Afra. (2005). Topology for Computing.

Balakrishnan, S., Rinaldo, A., Sheehy,D., Singh,A. and Wasserman, L. (2011). Minimax Rates for Homology Inference.

Ghrist (2008), Barcodes: The persistent topology of data, Bulletin-American Mathematical Society, 45, page 61.

Y. Dabaghian, F. Mémoli, L. Frank, G. Carlsson. (2012). A topological paradigm for hippocampal spatial map formation using persistent homology. PLoS Computational Biology. link.

Niyogi, P. and Smale, S. and Weinberger, S. (2011). A topological view of unsupervised learning from noisy data. SIAM Journal on Computing, 40, p 646.

Caillerie, Chazal, Dedecker, and Michel. (2012). Deconvolution for the Wasserstein metric and geometric inference. Electronic Journal of Statistics, 5, 1394–1423. link.

— Larry Wasserman

1. Christian Hennig
Posted July 2, 2012 at 6:12 pm | Permalink

“What are we actually estimating?” I wonder whether a “underlying true density”-probability setup is appropriate to “explain” what TDA is actually doing. I don’t have better things to offer right now so may be convinced eventually, but at first sight probability modelling seems to be a rather artificial post-hoc thing here, and I’d suspect that a useful explanation of it could be given from a rather different point of view. To illustrate what I’m getting at: I think that “estimating high density regions and failing” is a rather inappropriate description of what Single Linkage is about, and “achieving good full separation of clusters at any cost” and “best approximation of an ultrametric from below” look better to me, and neither of these has probability in it.

• Posted July 3, 2012 at 8:24 am | Permalink

Good point.
Most of the applications focus on physical processes where there
seems to be some underlying manifold and they want the homology of that manifold.
But I have seen more exploratory examples too.
–Larry

2. Posted July 3, 2012 at 11:36 am | Permalink

I think Ron Atkin introduced this stuff in the early 1970’s with his q-analysis (see http://en.wikipedia.org/wiki/Q-analysis). The group I was with at the time were never able to successfully apply it to any problems we cared about. I suspect that you need to pick your problems very carefully and completely re-orient your mindset to be able to apply this stuff (and even then …).

• John Wilder
Posted August 20, 2012 at 1:57 pm | Permalink

What were the problems that you cared about?

Atkin (still alive!) recently published MATHEMATICAL PHYSICS, An In-Depth Study, which I found interesting.

• Posted September 2, 2012 at 5:41 pm | Permalink

We were psychologists and looking at things like environmental psychology (reactions of people to their environments) and experimental asthetics. From memory the empirical work that Atkin reported in IJMMS consisted of analyses of the physical and organisational structure of his unviversity (I particularly enjoyed the description of the Vice Chancellor as an object filling a q-hole in the committe structure of the university) and the English town of Saffron Walden. I think the issues we had arose from our data not having a naturally combinatoric structure and our relations beings graded rather than naturally binary. (Also, from memory) Atkin’s analysis was about representing the structure in a particular way which was then subbjectively interpreted rather than subsjected to statistical analysis – so the reults were very much dependent on the skill (and biases) of the interpreter.

3. Keila
Posted August 17, 2012 at 5:13 pm | Permalink

Is there available some routine of an program, some code, or a software for to apply this methodology”Topological Data Analysis” in clustering of a data set?

• Posted August 17, 2012 at 5:25 pm | Permalink

Yes. The Matlab program called Plex.

4. Posted August 22, 2012 at 7:42 am | Permalink

Thanks for the concise summary. I am not an expert in this, but nobody else has pointed out what I see as a problem in an example given. The boundary of a triangle is not empty. Geometrically it is obvious that the boundary consists of three edges, and the combinatorics should reflect that. In most expositions an orientation of the simplex is taken into account, so that the boundary of $[Y_1,Y_2,Y_3]$ would be written as $[Y_2,Y_3] – [Y_1,Y_3] + [Y_1,Y_2]$. The negative sign is a detail, but the main point is that we cannot reduce or simplify this expression as it stands. It is a formal sum representing the three (oriented) edges of the triangle, and that is the boundary of the triangle. If we apply the boundary operator again to this chain of three edges we get $Y_3 – Y_2 – Y_3 + Y_1 + Y_2 – Y_1 = 0$, and this reflects the fact that the boundary operator applied twice vanishes identically.

• Posted August 22, 2012 at 8:02 am | Permalink

Ouch! Good point.
It is indeed two applications of the boundary operator that
(always) vanishes. Thanks for the correction.

Larry

1. […] Topological Data Analysis by Larry Wasserman. […]

2. […] Para quem deseja conhecer um pouquinho sobre essa área de pesquisa esse post do Normal Deviate apre… Gostar disso:GosteiSeja o primeiro a gostar disso. Etiquetado Agrupamento, Análise de Clusters, Análise Topológica de Dados, Clustering, TDA, Topological Data Analysis […]

3. […] may recall my previous post on Topological Data Analysis. I am happy to report that, in the interest of helping to advance this area of research, we (= […]

4. […] Topological Data Analysis […]

5. By Always the Last to Know: Ayasdi | Spark Consulting on January 28, 2013 at 6:47 am

[…] firm built on “topological data analysis,” which physically maps complex data, allowing various types of patterns to […]

6. By Topological Inference « Normal Deviate on March 30, 2013 at 9:39 pm

[…] Inference For Persistent Homology on arXiv. (I posted about topological data analysis earlier here.) The paper is written with Siva Balakrishnan, Brittany Fasy, Fabrizio Lecci, Alessandro Rinaldo […]