Monthly Archives: March 2013

Topological Inference

We uploaded a paper called Statistical Inference For Persistent Homology on arXiv. (I posted about topological data analysis earlier here.) The paper is written with Siva Balakrishnan, Brittany Fasy, Fabrizio Lecci, Alessandro Rinaldo and Aarti Singh.

The basic idea is this. We observe data {{\cal D} = \{X_1,\ldots, X_n\}} where {X_1,\ldots, X_n \sim P} and {P} is supported on a set {\mathbb{M}}. We want to infer topological features of {\mathbb{M}}. For example, here are data from a donut:

Donut

The top left plot is the set {\mathbb{M}}. The top right plot shows a sample from a distribution on {\mathbb{M}}. The bottom left plot is a union of balls around the data. The bottom right is something called a Čech complex which we can ignore here. The donut has one connected component and one hole. Can we recover this information from the data? Consider again the bottom left plot which shows

\displaystyle  \hat S_\epsilon = \bigcup_{i=1}^n B(X_i,\epsilon)

where {B(X_i,\epsilon)} is a ball of size {\epsilon} centered at {X_i}. When {\epsilon} is just right, {\hat S_\epsilon} will have one connected component and one hole, just like {\mathbb{M}}. Here are plots of {\hat S_\epsilon = \bigcup_{i=1}^n B(X_i,\epsilon)} for various values of {\epsilon}:

Union

As we vary {\epsilon}, topological features (connected components, holes etc) will be born and then die. Each feature can be represented by a pair {z=({\rm birth},{\rm death})} indicating the birth time and death time of that feature. A persistence diagram {{\cal P}} is a plot of the points {z}. Small features will have death times close to their birth times. In other words, small features correspond to a point {z} near the diagonal. Here is an example of a persistence diagram:

Persist

Informally, topological noise correspond to points near the diagonal and topological signal corresponds to points far from the diagonal. The goal of our paper is to separate noise from signal, in a statistical sense. Here are some details.

The distance function to {\mathbb{M}} is

\displaystyle  d_{\mathbb{M}}(x) = \inf_{y\in \mathbb{M}} ||x-y||

and the distance function to the data is

\displaystyle  \hat d(x) = \inf_{X_i\in {\cal D}} ||x-X_i||.

The persistence diagram {{\cal P}} of {\mathbb{M}} represents the evolution of the topology of the lower level sets {\{x:\ d_{\mathbb{M}}(x)\leq \epsilon\}} as a function of {\epsilon}. The estimate {\hat{\cal P}} of {{\cal P}} is the persistence diagram based on the lower level sets of {\hat d}, that is, {\{x:\ \hat d(x) \leq \epsilon\}.} But notice that {\{x:\ \hat d(x) \leq \epsilon\}} is precisely equal to {\hat S_\epsilon}:

\displaystyle  \hat S_\epsilon = \{x:\ \hat d(x) \leq \epsilon\}.

To make inferences about {{\cal P}} from {\hat{\cal P}} we need to infer how far apart these two diagrams are. The bottleneck distance {W_\infty(\hat{\cal P},{\cal P})} measures the distance between the estimated and true persistence diagram. Basically, this measures how much we have to move the points in {\hat{\cal P}} to match the points in {{\cal P}} as in this plot:

matching

Our paper shows several ways to construct a bound {c_n = c_n(X_1,\ldots, X_n)} so that

\displaystyle  \limsup_{n\rightarrow\infty} \mathbb{P}(W_\infty(\hat{\cal P},{\cal P}) > c_n) \leq \alpha.

Then we add a strip of size {c_n} (actually, of size {\sqrt{2}c_n}) to the persistence diagram. Points in the strip are declared to be “noise” as in this example:

stylized

To bound {W_\infty(\hat{\cal P},{\cal P})} we use the fact that {W_\infty(\hat{\cal P},{\cal P}) \leq H({\cal D},\mathbb{M})} where {H} is the Hausdorff distance which is defined by

\displaystyle  H(A,B) = \inf \Bigl\{ \epsilon:\ A \subset B \oplus \epsilon\ \ \ {\rm and}\ \ \ B \subset A \oplus \epsilon\Bigr\}

where

\displaystyle  A \oplus \epsilon = \bigcup_{x\in A}B(x,\epsilon)

and {B(x,\epsilon)} is a ball of size {\epsilon} centered at {x}.

To summarize: to find a confidence interval for {W_\infty(\hat{\cal P},{\cal P})} it suffices to get a confidence interval for {H({\cal D},\mathbb{M})}.

One approach for bounding {H({\cal D},\mathbb{M})} is subsampling. We draw random subsamples, {\Omega_1,\ldots,\Omega_N}, each of size {b}, from the original data. Here {b=b_n} satisfies {b_n = o(n)} and {b_n \rightarrow \infty} as {n\rightarrow\infty}. Then we compute

\displaystyle  T_j = H(\Omega_j,{\cal D}),\ \ \ j=1,\ldots, N.

Now we compute

\displaystyle  L(t) = \frac{1}{N}\sum_{j=1}^N I(T_j > t).

Finally, we get the upper quantile of this distribution, {c_n= 2L^{-1}(\alpha)}. (The factor of 2 is for technical reasons). This gives us what we want. Specifically,

\displaystyle  \mathbb{P}(W_\infty(\hat{\cal P}, {\cal P}) > c_n) \leq \alpha+ O\left(\sqrt{ \frac{b \log n}{n}}\right) + \frac{2^d}{n\log n} \approx \alpha

where {d} is the dimension of the set {\mathbb{M}}.

The paper contains several other methods for bounding {W_\infty(\hat{\cal P},{\cal P})}. Well, I hope I have piqued your interest. If so, have a look at the paper and let us know if you have any comments.

P.S. Sebastien Bubeck has a new blog, here

Amanda Knox and Statistical Nullification

In today’s New York Times, Leila Schneps and Coralie Colmez correctly warn that

… math can become a weapon that impedes justice and destroys innocent lives.

They discuss Lucia de Berk, and Sally Clark, two unfortunate people who were convicted of crimes based on bogus statistical arguments. Statistician Richard Gill helped get de Berk’s conviction overturned.

So why do Schneps and Colmez argue that statistics should be used to help re-try Amanda Knox in a what will be another Italian travesty of justice?

They criticize the judge for bad statistical reasoning. Referring to a new test of DNA on a knife, they say:

His reasoning? If the scientific community recognizes that a test on so small a sample cannot establish identity beyond a reasonable doubt, then neither could a second test test on an even smaller sample.

They go on to argue that basic statistics tells us that further testing would “tell us something about the likely accuracy of the first result. Getting the same result after a third test would give yet more credence to the original finding.”

Indeed. It would confirm that she had touched a knife in her own apartment. I guess you would find my DNA all over the cutlery in my house. I hope no one uses that as evidence in a murder trial.

The Amanda Knox trial was a joke; it never would have made it to court in the U.S.

All of which raises a question. Who would testify as a statistical expert in a court case without taking into account the context? Even if I thought Schneps and Colmez were correct in their statistical reasoning, I would not testify on the narrow question of whether a second test would strengthen that particular piece of evidence if I thought the trial was a scam.

Some time ago, I was asked to be an expert witness on a statistical issue regarding random drug testing of high school students in Pennsylvania. The attorney working for the tyrants (the school district) tried to convince me that he had a good statistical case for doing such testing. I told him that, despite the attractive fee and the correctness of his statistical arguments, I would not testify in a case that helped the government harass its own citizens.

What Schneps and Colmez failed to mention was context. I don’t care how sound the statistical argument is. If it is being used for unethical and unjust means, we should refuse to testify. Call it Statistical Nullification.

Shaking the Bayesian Machine

Yesterday we were fortunate to have Brad Efron visit our department and gave a seminar.

Brad is one of the most famous statisticians in the world and his contributions to the field of statistics are too numerous to list. Probably he is best known for: inventing the bootstrap, for starting the field of the geometry of statistical models, for least angle regression (an extension of the lasso he developed with Trevor Hastie, Iain Johnstone and Rob Tibshirani), for his work on empirical Bayes, large-scale multiple testing, and many other things.

He has won virtually every possible award including The MacArthur Award and The National Medal of Science.

So, as you can imagine, his visit was a big deal.

He spoke on Frequentist Accuracy of Bayesian Estimates. The key point in the talk was a simple formula for computing the frequentist standard error of a Bayes estimator. Suppose {\theta = t(\mu)} is the parameter of interest for a statistical model

\displaystyle  \{ f_\mu(x):\ \mu\in\Omega\}

and we want the standard error of the Bayes estimator {\mathbb{E}(\theta|x)}.

The formula is

\displaystyle  se = \left[ cov(t(\mu),\lambda\,|\,x)^T \, V \, cov(t(\mu),\lambda\,|\,x)\right]^{1/2}

where

\displaystyle  \lambda = \nabla_x \log f_\mu(x)

and

\displaystyle  V = Cov_\mu(X).

In addition, one can compute the Bayes estimator (in some cases) by using a re-weighted parametric bootstrap. This is much easier than the usual approach based on Monte Carlo Markov Chain. The latter leads to difficult questions about convergence that are completely avoided by using the re-weighted bootstrap.

See the link to his talk for details and examples.

Now you might ask: why would we want the frequentist standard error of a Bayesian estimator? There are (at least) two reasons.

First, the frequentist can view Bayes as a way to generate estimators. And, like any estimator, we would want to estimate its standard error.

Second, the Bayesian who is not completely certain of his prior (which I guess is every Bayesian) might be interested in doing a sensitivity analysis. One way to do such a sensitivity analysis is to ask: how much would my estimator change if I changed the data? The standard error provides one way to answer that question. Brad called this Shaking the Bayesian Machine. Great phrase.

Following the seminar we had a great dinner. I fully participated in copious reminiscing among senior faculty, while my young colleague Ryan Tibshirani wondered what we were talking about. I’m definitely one of the old experienced guys in the department now.

I asked Brad if he’d like to write a guest post for this blog. He didn’t say yes but he didn’t say no either. Let’s keep our fingers crossed. (I’m happy to report that my colleague Steve Fienberg did commit to writing one.)

Bottom line: Brad is doing lots of cool stuff. Check out his web page.

Double Misunderstandings About p-values

It’s been said a million times and in a million places that a p-value is not the probability of {H_0} given the data.

But there is a different type of confusion about p-values. This issue arose in a discussion on Andrew’s blog.

Andrew criticizes the New York times for giving a poor description of the meaning of p-values. Of course, I agree with him that being precise about these things is important. But, in reading the comments on Andrew’s blog, it occurred to me that there is often a double misunderstanding.

First, let me way that I am neither defending nor criticizing p-values in this post. I am just going to point out that there are really two misunderstandings floating around.

Two Misunderstandings

(1) The p-value is not the probability of {H_0} given the data.

(2) But neither is the p-value the probability of something conditional on {H_0}.

Deborah Mayo pointed this fact out in the discussion on Andrew’s blog (as did a few other people).

When we use p-values we are in frequentist-land. {H_0} (the null hypothesis) is not a random variable. It makes no sense to talk about the posterior probability of {H_0}. But it also makes no sense to talk about conditioning on {H_0}. You can only condition on things that were random in the first place.

Let me get more specific. Let {Z} be a test statistic and let {z} be the realized valued of {Z}. The p-value (in a two-sided test) is

\displaystyle  p = P_0(|Z| > |z|)

where {P_0} is the null distribution. It is not equal to {P\bigl(|Z|> |z| \, \bigm| \,H_0\bigr)}. This makes no sense. {H_0} is not a random variable. In case the null consists of a set {{\cal P}_0} of distributions, the p-value is

\displaystyle  p = \sup_{P\in {\cal P}_0}P(|Z| > |z|).

You could accuse me of being pedantic here or of being obsessed with notation. But given the amount of confusion about p-values, I think it is important to get it right.

More Misunderstandings

The same problem occurs when people write {p(x|\theta)}. When I teach Bayes, I do write the model as {p(x|\theta)}. When I teach frequentist statistics, I write this either as {p(x;\theta)} or {p_\theta(x)}. There is no conditioning going on. To condition on {\theta} would require a joint distribution for {(x,\theta)}. There is no such joint distribution in frequentist-land.

The coverage of a confidence interval {C(X_1,\ldots, X_n)} is not the probability that {C(X_1,\ldots, X_n)} traps {\theta} conditional on {\theta}. The frequentist coverage is

\displaystyle  {\rm Coverage} = \inf_{\theta} P_\theta\Bigl(\theta\in C(X_1,\ldots, X_n)\Bigr).

Again, there is no conditioning going on.

Conclusion

I understand that people often say “conditional on {\theta}” to mean “treating {\theta} as fixed.” But if we want to eradicate misunderstandings about statistics, I think it would help if we were more careful about how we choose our words.

The Gap

No, not the store. I am referring to the gap between the invention of a method and the theoretical justification for that method.

It is hard to quantify the gap because, for any method, there is always dispute about who invented (discovered?) it, and who nailed the theory. For example, maximum likelihood is usually credited to Ronald Fisher but one could argue that it was really invented by Gauss or Laplace. And who provided the theory for maximum likelihood? Well, most people would say Fisher. But you could also argue that Le Cam was the one who really made it rigorous.

So, with these caveats in my mind, here are a few examples of the gap. I am interested to hear your examples (as well as your opinions about my examples.)

METHOD INVENTOR THEORETICAL
JUSTIFICATION
Maximum Likelihood Fisher 1912 Fisher 1922
The Bootstrap Efron 1979 Gine and Zinn 1990
The Lasso Tibshirani 1996 Greenshtein and Ritov 2004
False Discovery Rate Schweder, Spotvoll 1982 Benjamini and Hochberg 1995
Wavelet Denoising Haar 1905 Donoho and Johnstone 1998
Cross-validation Stone 1974 Gyorfi, Kohler, Krzyzak,
and Walk 2002
Causal Inference Rubin 1974 Rubin 1974, Robins 1989
(Counterfactual Version)
Causal Inference Wright 1934 Spirtes, Glymour, Scheines 1987
(Graphical Version) and Pearl and Verma 1991

Every item in the above list is debatable. For example, I consider Gine amd Zinn (1990) the first definitive result that clarifies necessary and sufficient conditions to justify the bootstrap. But one could well argue that earlier papers Bickel and Freedman (1981) and Singh (1981) deserve that label.

What items do you disagree with?

What items would you add?

Partial List of References

John Aldrich. (1997). R.A. Fisher and the making of maximum likelihood 1912-1922. Statist. Sci., 12, 162-176.

Donoho, David L and Johnstone, Iain M. Minimax estimation via wavelet shrinkage. The Annals of Statistics, 26, 879-921.

Efron, Bradley. (1997). Bootstrap methods: another look at the jackknife. The annals of Statistics, 7, 1-26.

Gine, Evarist and Zinn, Joel. (1990). Bootstrapping general empirical measures. The Annals of Probability, 851–869.

Greenshtein, Eitan and Ritov, Ya’Acov. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli, 10, 971-988.

Gy{ö}rfi, L{á}szl{ó} and Kohler, Michael and Krzyzak, Adam and Walk, Harro. (2002). A distribution-free theory of nonparametric regression, Springer.

Tibshirani, Robert. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B, 267–288.

P.S. Thanks to Ryan Tibshirani and David Choi for discussions on this topic.