## Steve Marron on “Big Data”

Steve Marron is a statistician at UNC. In his younger days he was well known for his work on nonparametric theory. These days he works on a number of interesting things including analysis of structured objects (like tree-structured data) and high dimensional theory.

Steve sent me a thoughtful email the other day about “Big Data” and, with his permission, I am posting it here.

I agree with pretty much everything he says. I especially like these two gems: First, “a better funded statistical community would be a more efficient way to get such things done without all this highly funded re-discovery.” And second: “I can see a strong reason why it DOES NOT make sense to fund our community better. That is our community wide aversion to new ideas.”

Enough from me. Here is Steve’s comment:

Guest Post, by Steve Marron

My colleagues and I have lately been discussing “Big Data”, and your blog was mentioned.

Not surprisingly you’ve got some interesting ideas there. Here come some of my own thoughts on the matter.

First should one be pessimistic? I am not so sure. For me exhibit A is my own colleagues. When such things came up in the past (and I believe that this HAS happened, see the discussion below) my (at that time senior) colleagues were rather arrogantly ignorant. Issues such as you are raising were blatantly pooh poohed, if they were ever considered at all. However, this time around, I am seeing a far more different picture. My now mostly junior colleagues are taking this very seriously, and we are currently engaged in major discussion as to what we are going to do about this in very concrete terms such as course offerings, etc. In addition, while some of my colleagues think in terms of labels such as “applied statistician”, “theoretical statistician” and “probabilist”, everybody across the board is jumping in. Perhaps this is largely driven by an understanding that universities themselves are in a massive state of flux, and that one had better be a player, or else be totally left behind. But it sure looks better than some of the attitudes I saw earlier on in my career.

Now about the bigger picture. I think there is an important history here that you are totally ignoring. In particular, I view “Big Data” as just the latest manifestation of a cycle that has been rolling along for quite a long time. Actually I have been predicting the advent of something of this type for quite a while (although I could not predict the name, nor the central idea).

Here comes a personally slanted (certainly over-simplified) view of what I mean here. Think back on the following set of “exciting breakthroughs”:

– Statistical Pattern Recognition
– Artificial Intelligence
– Neural Nets
– Data Mining
– Machine Learning

Each of these was started up in EE/CS. Each was the fashionable hot topic (considered very sexy and fresh by funding agencies) of its day. Each was initially based on usually one really cool new idea, which was usually far outside of what folks working in conventional statistics had any hope (well certainly no encouragement from the statistical community) of dreaming up. I think each attracted much more NSF funding than all of statistics ever did, at any given time. A large share of the funding was used for re-invention of ideas that already existed in statistics (but would get a sexy new name). As each new field matured, there came a recognition that in fact much was to be gained by studying connections to statistics, so there was then lots of work “creating connections”.

Now given the timing of these, and how they each have played out, over time, it had been clear to me for some time that we were ripe for the next one. So the current advent of Big Data is no surprise at all. Frankly I am a little disappointed that there does not seem to be any really compelling new idea (e.g. as in neural nets or the kernel embedding idea that drove machine learning). But I suspect that the need for such a thing to happen to keep this community properly funded has overcome the need for an exciting new idea. Instead of new methodology, this seems to be more driven by parallelism and cloud computing. Also I seem to see larger applied math buy-in than there ever was in the past. Maybe this is the new parallel to how optimization has appeared in a major way in machine learning.

Next, what should we do about it? Number one of course is to get engaged, and as noted above, I am heartened at least at my own local level as discussed above.

I generally agree with your comment about funding, and I can think of ways to sell statistics. For example, we should make the above history clear to funding agencies, and point out that in each case there has been a huge waste of resources on people doing a large amount of rediscovery. In most of those areas, by the time the big funding hits, the main ideas are already developed so the funding really just keeps lots of journeymen doing lots of very low impact work, with large amounts of rediscovery of things already known in the statistical community. The sell could be that a better funded statistical community would be a more efficient way to get such things done without all on this highly funded re-discovery.

But before making such a case, I suggest that is it important to face up to our own shortcomings, from the perspective of funding agencies. I can see a strong reason why it DOES NOT make sense to fund our community better. That is our community wide aversion to new ideas. While I love working with statistical concepts, and have a personal love of new ideas, it has not escaped my notice that I have always been in something of a minority in that regard. We not only do not choose to reward creativity, we often tend to squelch it. I still remember the first time I applied for an NSF grant. I was ambitious, and the reviews I got back said the problem was interesting, but I had no track record, the reviewers were skeptical of me, and I did not get funded. This was especially frustrating as by the time I got those reviews I had solved the stated problem. It would be great if that could be regarded as an anomaly of the past when folks may have been less enlightened than now. However, I have direct evidence that this is not true. Unfortunately exactly that cycle repeated itself for one of my former students on this very last NSF cycle.

What should we do to deserve more funding? Somehow we need a bigger tent, which is big enough to include the creative folks who will be coming up with the next really big ideas (big enough to include the folks who are going to spawn the next new community, such as those listed above). This is where research funding should really be going to be most effective.

Maybe more important, we need to find a way to create a statistical culture that reveres new ideas, instead of fearing and shunning them.

Best, Steve

Brad Efron wrote to me and posed an interesting statistical question:

“Last Wednesday Diane Sawyer interviewed an Oklahoma woman who twice
hundred-trillion chance!” said Diane. ABC showed a nice map with the
and 1 mile wide. Then the track of the 1999 storm was superimposed,
about the same dimensions, the two intersecting in a roughly 1 square
mile lozenge. Diane added that the woman “lives right in the center of

Question: what odds should have Diane quoted? (and for that matter,
what is the right event to consider?)

By the way, I should add that Diane Sawyer has a history of
broadcasting stories filled with numerical illiteracy. She did a long
series opposing the use of lean finely textured beef (LFTB), also
known as “pink slime.” In fact, LFTB is perfectly healthy, its use
requires slaughtering many fewer cows each year and makes meat cheaper
for poor people. The series was denounced by many scientists and even
environmentalists. ABC is being sued for over one billion dollars.

She also did a long series on “Buy America” encouraging people to
shun cheap goods from abroad. This is like telling people who live in
Cleveland to shun buying any products and services not produced in
Cleveland (including not watching ABC news which is produced in New
York, or reading statistics papers not written in Cleveland.) This
high-school level mistake in economics is another example of Ms.
Sawyer’s numerical illiteracy.

But I digress.

What is a good way to compute the odds that someone has their house

I open it up for discussion.

Something that is well known in the statistics world but perhaps less well known in the machine learning world is Stein’s paradox.

When I was growing up, people used to say: do you remember where you were when you heard that JFK died? (I was three, so I don’t remember. My first memory is watching the Beatles on Ed Sullivan.)

Similarly, statisticians used to say: do you remember where you were when you heard about Stein’s paradox? That’s how surprising it was. (I don’t remember since I wasn’t born yet.)

Here is the paradox. Let ${X \sim N(\theta,1)}$. Define the risk of an estimator ${\hat\theta}$ to be

$\displaystyle R_{\hat\theta}(\theta) = \mathbb{E}_\theta (\hat\theta-\theta)^2 = \int (\hat\theta(x) - \theta)^2 p(x;\theta) dx.$

An estimator ${\hat\theta}$ is inadmissible if there is another estimator ${\theta^*}$ with smaller risk. In other words, if

$\displaystyle R_{\theta^*}(\theta) \leq R_{\hat\theta}(\theta) \ \ {\rm for\ all\ }\theta$

with strict inequality at at least one ${\theta}$.

Question: Is ${\hat \theta \equiv X}$ admissible.

Now suppose that ${X \sim N(\theta,I)}$ where now ${X=(X_1,X_2)^T}$, ${\theta = (\theta_1,\theta_2)^T}$ and

$\displaystyle R_{\hat\theta}(\theta) = \mathbb{E}_\theta ||\hat\theta - \theta||^2.$

Question: Is ${\hat \theta \equiv X}$ admissible.

Now suppose that ${X \sim N(\theta,I)}$ where now ${X=(X_1,X_2,X_3)^T}$, ${\theta = (\theta_1,\theta_2,\theta_3)^T}$ and

$\displaystyle R_{\hat\theta}(\theta) = \mathbb{E}_\theta ||\hat\theta - \theta||^2.$

Question: is ${\hat \theta \equiv X}$ admissible.

If you don’t find this surprising then either you’ve heard this before or you’re not thinking hard enough. Keep in mind that the coordinates of the vector ${X}$ are independent. And the ${\theta_j's}$ could have nothing to do with each other. For example, ${\theta_1 = }$ mass of the moon, ${\theta_2 = }$ price of coffee and ${\theta_3 = }$ temperature in Rome.

In general, ${\hat\theta \equiv X}$ is inadmissible if the dimension ${k}$ of ${\theta}$ satisfies ${k \geq 3}$.

The proof that ${X}$ is inadmissible is based on defining an explicit estimator ${\theta^*}$ that has smaller risk than ${X}$. For example, the James-Stein estimator is

$\displaystyle \theta^* = \left( 1 - \frac{k-2}{||X||^2}\right) X.$

It can be show that the risk of this estimator is strictly smaller than the risk of ${X}$, for all ${\theta}$. This implies that ${X}$ is inadmissible. If you want to see the detailed calculations, have a look at Iain Johnstone’s at this site which he makes freely available on his website.

Note that the James-Stein estimator shrinks ${X}$ towards the origin. (In fact, you can shrink towards any point; there is nothing special about the origin.) This can be viewed as an empirical Bayes estimator where ${\theta}$ has a prior of the form ${\theta \sim N(0,\tau^2)}$ and ${\tau}$ is estimated from the data. The Bayes explanation gives some nice intuition. But it’s also a bit misleading. The Bayes explanation suggests we are shrinking the means together because we expect them a priori to be similar. But the paradox holds even when the means are not related in any way.

Some intuition can be gained by thinking about function estimation. Consider a smooth function ${f(x)}$. Suppose we have data

$\displaystyle Y_i = f(x_i) + \epsilon_i$

where ${x_i = i/n}$ and ${\epsilon_i \sim N(0,1)}$. Let us expand ${f}$ in an orthonormal basis: ${f(x) = \sum_j \theta_j \psi_j(x)}$. To estimate ${f}$ we need only estimate the coefficients ${\theta_1,\theta_2,\ldots,}$. Note that ${\theta_j = \int f(x) \psi_j(x) dx}$. This suggests the estimator

$\displaystyle \hat\theta_j = \frac{1}{n}\sum_{i=1}^n Y_i \psi_j(x_i).$

But the resulting function estimator ${\hat f(x) = \sum_j \hat\theta_j \psi_j(x)}$ is useless because it is too wiggly. The solution is to smooth the estimator; this corresponds to shrinking the raw estimates ${\hat\theta_j}$ towards 0. This adds bias but reduces variance. In other words, the familiar process of smoothing, which we use all the time for function estimation, can be seen as “shrinking estimates towards 0” as with the James-Stein estimator.

If you are familiar with minimax theory, you might find the Stein paradox a bit confusing. The estimator ${\hat\theta = X}$ is minimax, that is, it’s risk achieves the minimax bound

$\displaystyle \inf_{\hat\theta}\sup_\theta R_{\hat\theta}(\theta).$

This suggests that ${X}$ is a good estimator. But Stein’s paradox tells us that ${\hat\theta = X}$ is inadmissible which suggests that it is a bad estimator.

No. The risk ${R_{\hat\theta}(\theta)}$ of ${\hat\theta=X}$ is a constant. In fact, ${R_{\hat\theta}(\theta)=k}$ for all ${\theta}$ where ${k}$ is the dimension of ${\theta}$. The risk ${R_{\theta^*}(\theta)}$ of the James-Stein estimator is less than the risk of ${X}$, but, ${R_{\theta^*}(\theta)\rightarrow k}$ as ${||\theta||\rightarrow \infty}$. So they have the same maximum risk.

On the one hand, this tells us that a minimax estimator can be inadmissible. On the other hand, in some sense it can’t be “too far” from admissible since they have the same maximum risk.

Stein first reported the paradox in 1956. I suspect that fewer and fewer people include the Stein paradox in their teaching. (I’m guilty.) This is a shame. Paradoxes really grab students’ attention. And, in this case, the paradox is really fundamental to many things including shrinkage estimators, hierarchical Bayes, and function estimation.

## Aaronson, COLT, Bayesians and Frequentists

Aaronson, COLT, Bayesians and Frequentists

I am reading Scott Aaronson’s book “Quantum Computing Since Democritus” which can be found here.

The book is about computational complexity, quantum mechanics, quantum computing and many other things. It’s a great book and I highly recommend it. Much of the material on complexity classes is tough going but you can skim over some of the details and still enjoy the book. (That’s what I am doing.) There at least 495 different complexity classes: see the complexity zoo. I don’t know how anyone can keep track of this.

Anyway, there is a chapter on computational learning theory that I wanted to comment on. (There is another chapter about probabilistic reasoning and the anthropic principle which I’ll comment on in a future post.) Scott gives a clear introduction to learning theory and he correctly traces the birth of the theory to Leslie Valiant’s 1984 paper that introduced PAC (probably almost correct) learning. He also contrasts PAC learning with Bayesian learning.

Now I want to put on my statistical curmudgeon hat and complain about computational learning theory. My complaint is this: the discovery of computational learning theory is nothing but the re-discovery of a 100 year old statistical idea called a “confidence interval.”

Let’s review some basic learning theory. Let ${{\cal R}}$ denote all axis aligned rectangles in the plane. We can think of each rectangle ${R}$ as a classifier: predict ${Y=1}$ if ${X\in R}$ and predict ${Y=0}$ if ${X\notin R}$. Suppose we have data ${(X_1,Y_1),\ldots, (X_n,Y_n)}$. If we pick the rectangle ${\hat R}$ that makes the fewest classification errors on the data, will we predict well on a new observation? More formally: is the empirical risk estimator a good predictor?

Yes. The reason is simple. Let ${L(R) = \mathbb{P}(Y\notin R)}$ be the prediction risk and let

$\displaystyle L_n(R) = \frac{1}{n}\sum_{i=1}^n I(Y_i \notin R)$

be the empirical estimate of the risk. We would like to claim that ${\hat R}$ is close to the best classifier in ${{\cal R}}$. That is, we would like to show that ${L(\hat R)}$ is close to ${\inf_{R\in {\cal R}} L(R)}$, with high probability. This fact follows easily if we can show that

$\displaystyle \sup_{R\in {\cal R}} | L_n(R) - L(R)|$

is small with high probability. And this does hold since the VC dimension of ${{\cal R}}$ is finite. Specifically, the key fact is that, for any distribution of the data ${P}$, we have

$\displaystyle P^n\Bigl(\sup_{R\in {\cal R}} | L_n(R) - L(R)| > \epsilon\Bigr) \leq c_1 \exp\left( - c_2 n \epsilon^2 \right)$

for known constants ${c_1}$ and ${c_2}$.

But this is equivalent to saying that a ${1-\alpha}$ confidence interval for ${L(\hat R)}$ is

$\displaystyle C_n = [L_n(\hat R) - \epsilon_n,\ L_n(\hat R) + \epsilon_n]$

where

$\displaystyle \epsilon_n = \sqrt{\frac{1}{n c_2}\log\left(\frac{c_1}{\alpha}\right)}.$

That is, for any distribution ${P}$,

$\displaystyle P^n( L(\hat R) \in C_n)\geq 1-\alpha.$

As Scott points out, what distinguishes this type of reasoning from Bayesian reasoning, is that we require this to hold uniformly, and that there are no priors involved. To quote from his book:

This goes against a belief in the Bayesian religion, that if your priors are different then you come to an entirely different conclusion. The Bayesian starts out with a probability distribution over the possible hypotheses. As you get more and more data, you update this distribution using Bayes’s rule.

That’s one way to do it, but computational learning theory tells us that it’s not the only way. You don’t need to start out with any assumptions about a probability distribution over the hypotheses … you’d like to learn any hypothesis in the concept class, for any sample distribution, with high probability over the choice of samples. In other words, you can trade the Bayesians’ probability distribution over hypotheses for a probability distribution over sample data.

(Note: “hypothesis” = classifier and “concept class” = set of classifiers, “learn” = estimate).

Now, I agree completely with the above quote. But as I said, it is basically the definition of a frequentist confidence interval.

So my claim is that computational learning theory is just the application of frequentist confidence intervals to classification.

There is nothing bad about that. The people who first developed learning theory were probably not aware of existing statistical theory so they re-developed it themselves and they did it right.

But it’s my sense — and correct me if I’m wrong — that many people in computational learning theory are still woefully ignorant about the field of statistics. It would be nice if someone in the field read the statistics literature and said: “Hey, these statistics guys did this 50 years ago!”

Am I being too harsh?

## The Perils of Hypothesis Testing … Again

A few months ago I posted about John Ioannidis’ article called “Why Most Published Research Findings Are False.”

Ioannidis is once again making news by publishing a similar article aimed at neuroscientists. This paper is called “Power failure: why small sample size undermines the reliability of neuroscience.” The paper is written by Button, Ioannidis, Mokrysz, Nosek, Flint, Robinson and Munafo.

When I discussed the first article, I said that his points were correct but hardly surprising. I thought it was fairly obvious that ${P(A|H_0) \neq P(H_0|A)}$ where ${A}$ is the event that a result is declared significant and ${H_0}$ is the event that the null hypothesis is true. But the fact that the paper had such a big impact made me realize that perhaps I was too optimistic. Apparently, this fact does need to be pointed out.

The new paper has basically the same message although the emphasis is on the dangers of low power. Let us assume that for a fraction of studies ${\pi}$, the null is actually false. That is ${P(H_0) = 1-\pi}$. Let ${\gamma}$ be the power. Then the probability of a false discovery, assuming we reject ${H_0}$ when the p-value is less than ${\alpha}$, is

$\displaystyle P(H_0|A) = \frac{ P(A|H_0) P(H_0)}{ P(A|H_0) P(H_0)+ P(A|H_1) P(H_1)} = \frac{\alpha (1-\pi)}{\alpha (1-\pi)+ \gamma \pi}.$

Let us suppose, for the sake of illustration that ${\pi = 0.1}$ (most nulls are true). Then the probability of a false discovery (using ${\alpha}$ = 0.05) looks like this as a function of power:

So indeed, if the power is low, the chance of a false discovery is high. (And things are worse if we include the effects of bias.)

The authors go on to estimate the typical neuroscience studies. They conclude that the typical power is between .08 and .31. I applaud them for trying to come up with some estimate of the typical power but I doubt that the estimate is very reliable.

The paper concludes with a number of sensible recommendations such as: performing power calculations before doing a study, disclosing methods transparently and so on. I wish they had included one more recommendation: focus less on testing and more on estimation.

So, like the first paper, I am left with the feeling that this message, too, is correct, but not surprising. But I guess that these points are not so obvious to many users of statistics. In that case, papers like these serve an important function.

## Data Science: The End of Statistics?

Data Science: The End of Statistics?

As I see newspapers and blogs filled with talk of “Data Science” and “Big Data” I find myself filled with a mixture of optimism and dread. Optimism, because it means statistics is finally a sexy field. Dread, because statistics is being left on the sidelines.

The very fact that people can talk about data science without even realizing there is a field already devoted to the analysis of data — a field called statistics — is alarming. I like what Karl Broman says:

When physicists do mathematics, they don’t say they’re doing “number science”. They’re doing math.

If you’re analyzing data, you’re doing statistics. You can call it data science or informatics or analytics or whatever, but it’s still statistics.

Well put.

Maybe I am just pessimistic and am just imagining that statistics is getting left out. Perhaps, but I don’t think so. It’s my impression that the attention and resources are going mainly to Computer Science. Not that I have anything against CS of course, but it is a tragedy if Statistics gets left out of this data revolution.

Two questions come to mind:

1. Why do statisticians find themselves left out?

2. What can we do about it?

I’d like to hear your ideas. Here are some random thoughts on these questions. First, regarding question 1.

1. Here is a short parable: A scientist comes to a statistician with a question. The statistician responds by learning the scientific background behind the question. Eventually, after much thinking and investigation, the statistician produces a thoughtful answer. The answer is not just an answer but an answer with a standard error. And the standard error is often much larger than the scientist would like.

The scientist goes to a computer scientist. A few days later the computer scientist comes back with spectacular graphs and fast software.

Who would you go to?

I am exaggerating of course. But there is some truth to this. We statisticians train our students to be slow and methodical and to question every assumption. These are good things but there is something to be said for speed and flashiness.

2. Generally, speaking, statisticians have limited computational skills. I saw a talk a few weeks ago in the machine learning department where the speaker dealt with a dataset of size 10 billion. And each data point had dimension 10,000. It was very impressive. Few statisticians have the skills to do calculations like this.

On to question 2. What do we do about it?

Whining won’t help. We can complain that that “data scientists” are ignoring biases, not computing standard errors, not stating and checking assumption and so on. No one is listening.

First of all, we need to make sure our students are competitive. They need to be able to do serious computing, which means they need to understand data structures, distributed computing and multiple programming languages.

Second, we need to hire CS people to be on the faculty in statistics department. This won’t be easy: how do we create incentives for computer scientists to take jobs in statistics departments?

Third, statistics needs a separate division at NSF. Simply renaming DMS (Division of Mathematical Sciences) as has been debated, isn’t enough. We need our own pot of money. (I realize this isn’t going to happen.)

To summarize, I don’t really have any ideas. Does anyone?

## Super-efficiency: “The Nasty, Ugly Little Fact”

Super-efficiency: The Nasty, Ugly Little Fact

I just read Steve Stigler’s wonderful article entitled: “The Epic Story of Maximum Likelihood.” I don’t know why I didn’t read this paper earlier. Like all of Steve’s papers, it is at once entertaining and scholarly. I highly recommend it to everyone.

As the title suggests, the paper discusses the history of maximum likelihood with a focus on Fisher’s “proof” that the maximum likelihood estimator is optimal. The “nasty, ugly little fact” is the problem of super-efficiency.

1. Hodges Example

Suppose that

$\displaystyle X_1, \ldots, X_n \sim N(\theta,1).$

The maximum likelihood estimator (mle) is

$\displaystyle \hat\theta = \overline{X}_n = \frac{1}{n}\sum_{i=1}^n X_i.$

We’d like to be able to say that the mle is, in some sense, optimal.

The usual way we teach this, is to point out that ${Var(\hat\theta) = 1/n}$ and that any other consistent estimator must have a variance which is at least this large (asymptotically).

Hodges’ famous example shows that this is not quite right. Hodges’ estimator is:

$\displaystyle T_n = \begin{cases} \overline{X}_n & \mbox{if } |\overline{X}_n| \geq \frac{1}{n^{1/4}}\\ 0 & \mbox{if } |\overline{X}_n| < \frac{1}{n^{1/4}}. \end{cases}$

If ${\theta\neq 0}$ then eventually ${T_n = \overline{X}_n}$ and hence

$\displaystyle \sqrt{n}(T_n - \theta) \rightsquigarrow N(0,1).$

But if ${\theta =0}$, then eventually ${\overline{X}_n}$ is in the window ${[-n^{-1/4},n^{-1/4}]}$ and hence ${T_n = 0}$. i.e. it is equal to the true value. Thus, when ${\theta \neq 0}$, ${T_n}$ behaves like the mle. But when ${\theta=0}$, it is better than the mle.

Hence, the mle is not optimal, at least, not in the sense Fisher claimed.

2. Rescuing the mle

Does this mean that the claim that the mle is optimal is doomed? Not quite. Here is a picture (from Wikipedia) of the risk of the Hodges estimator for various values of ${n}$:

There is a price to pay for the small risk at ${\theta=0}$: the risk for values near 0 is huge. Can we leverage the picture above into a precise statement about optimality?

First, if we look at the maximum risk rather than the pointwise risk then we see that the mle is optimal. Indeed, ${\overline{X}_n}$ is the unique estimator that is minimax for all bowl-shaped estimators. See my earlier post on this.

Second, Le Cam showed that the mle is optimal among all regular estimators. These are estimators whose distribution is not affected by small changes in the parameter. This is known as Le Cam’s convolution theorem because he showed that the limiting distribution of any regular estimator is equal to the distribution of the mle plus (convolved with) another distribution. (There are, of course, regularity assumptions involved.)

Chapter 8 of van der Vaart (1998) is a good reference for these results.

3. Why Do We Care?

The idea of all of this, was not to rescue the claim that “the mle is optimal” at any cost. Rather, we had a situation where it was intuitively clear that something was true in some sense but it was difficult to make it precise.

Making the sense in which the mle is optimal precise represents an intellectual breakthrough in statistics. The deep mathematical tools that Le Cam developed have been used in many aspects of statistical theory. Two reviews of Le Cam theory can be found here and here.

That the mle is optimal seemed intuitively clear and yet turned out to be a subtle and deep fact. Are there other examples of this in Statistics and Machine Learning?

References

Stigler, S. (2007). The epic story of maximum likelihood. Statistical Science, 22, 598-620.

van der Vaart. (1998). Asymptotic Statistics. Cambridge.

van der Vaart, Aad. (2002). The statistical work of Lucien Le Cam. Ann. Statist., 30, 631-682.

## Topological Inference

We uploaded a paper called Statistical Inference For Persistent Homology on arXiv. (I posted about topological data analysis earlier here.) The paper is written with Siva Balakrishnan, Brittany Fasy, Fabrizio Lecci, Alessandro Rinaldo and Aarti Singh.

The basic idea is this. We observe data ${{\cal D} = \{X_1,\ldots, X_n\}}$ where ${X_1,\ldots, X_n \sim P}$ and ${P}$ is supported on a set ${\mathbb{M}}$. We want to infer topological features of ${\mathbb{M}}$. For example, here are data from a donut:

The top left plot is the set ${\mathbb{M}}$. The top right plot shows a sample from a distribution on ${\mathbb{M}}$. The bottom left plot is a union of balls around the data. The bottom right is something called a Čech complex which we can ignore here. The donut has one connected component and one hole. Can we recover this information from the data? Consider again the bottom left plot which shows

$\displaystyle \hat S_\epsilon = \bigcup_{i=1}^n B(X_i,\epsilon)$

where ${B(X_i,\epsilon)}$ is a ball of size ${\epsilon}$ centered at ${X_i}$. When ${\epsilon}$ is just right, ${\hat S_\epsilon}$ will have one connected component and one hole, just like ${\mathbb{M}}$. Here are plots of ${\hat S_\epsilon = \bigcup_{i=1}^n B(X_i,\epsilon)}$ for various values of ${\epsilon}$:

As we vary ${\epsilon}$, topological features (connected components, holes etc) will be born and then die. Each feature can be represented by a pair ${z=({\rm birth},{\rm death})}$ indicating the birth time and death time of that feature. A persistence diagram ${{\cal P}}$ is a plot of the points ${z}$. Small features will have death times close to their birth times. In other words, small features correspond to a point ${z}$ near the diagonal. Here is an example of a persistence diagram:

Informally, topological noise correspond to points near the diagonal and topological signal corresponds to points far from the diagonal. The goal of our paper is to separate noise from signal, in a statistical sense. Here are some details.

The distance function to ${\mathbb{M}}$ is

$\displaystyle d_{\mathbb{M}}(x) = \inf_{y\in \mathbb{M}} ||x-y||$

and the distance function to the data is

$\displaystyle \hat d(x) = \inf_{X_i\in {\cal D}} ||x-X_i||.$

The persistence diagram ${{\cal P}}$ of ${\mathbb{M}}$ represents the evolution of the topology of the lower level sets ${\{x:\ d_{\mathbb{M}}(x)\leq \epsilon\}}$ as a function of ${\epsilon}$. The estimate ${\hat{\cal P}}$ of ${{\cal P}}$ is the persistence diagram based on the lower level sets of ${\hat d}$, that is, ${\{x:\ \hat d(x) \leq \epsilon\}.}$ But notice that ${\{x:\ \hat d(x) \leq \epsilon\}}$ is precisely equal to ${\hat S_\epsilon}$:

$\displaystyle \hat S_\epsilon = \{x:\ \hat d(x) \leq \epsilon\}.$

To make inferences about ${{\cal P}}$ from ${\hat{\cal P}}$ we need to infer how far apart these two diagrams are. The bottleneck distance ${W_\infty(\hat{\cal P},{\cal P})}$ measures the distance between the estimated and true persistence diagram. Basically, this measures how much we have to move the points in ${\hat{\cal P}}$ to match the points in ${{\cal P}}$ as in this plot:

Our paper shows several ways to construct a bound ${c_n = c_n(X_1,\ldots, X_n)}$ so that

$\displaystyle \limsup_{n\rightarrow\infty} \mathbb{P}(W_\infty(\hat{\cal P},{\cal P}) > c_n) \leq \alpha.$

Then we add a strip of size ${c_n}$ (actually, of size ${\sqrt{2}c_n}$) to the persistence diagram. Points in the strip are declared to be “noise” as in this example:

To bound ${W_\infty(\hat{\cal P},{\cal P})}$ we use the fact that ${W_\infty(\hat{\cal P},{\cal P}) \leq H({\cal D},\mathbb{M})}$ where ${H}$ is the Hausdorff distance which is defined by

$\displaystyle H(A,B) = \inf \Bigl\{ \epsilon:\ A \subset B \oplus \epsilon\ \ \ {\rm and}\ \ \ B \subset A \oplus \epsilon\Bigr\}$

where

$\displaystyle A \oplus \epsilon = \bigcup_{x\in A}B(x,\epsilon)$

and ${B(x,\epsilon)}$ is a ball of size ${\epsilon}$ centered at ${x}$.

To summarize: to find a confidence interval for ${W_\infty(\hat{\cal P},{\cal P})}$ it suffices to get a confidence interval for ${H({\cal D},\mathbb{M})}$.

One approach for bounding ${H({\cal D},\mathbb{M})}$ is subsampling. We draw random subsamples, ${\Omega_1,\ldots,\Omega_N}$, each of size ${b}$, from the original data. Here ${b=b_n}$ satisfies ${b_n = o(n)}$ and ${b_n \rightarrow \infty}$ as ${n\rightarrow\infty}$. Then we compute

$\displaystyle T_j = H(\Omega_j,{\cal D}),\ \ \ j=1,\ldots, N.$

Now we compute

$\displaystyle L(t) = \frac{1}{N}\sum_{j=1}^N I(T_j > t).$

Finally, we get the upper quantile of this distribution, ${c_n= 2L^{-1}(\alpha)}$. (The factor of 2 is for technical reasons). This gives us what we want. Specifically,

$\displaystyle \mathbb{P}(W_\infty(\hat{\cal P}, {\cal P}) > c_n) \leq \alpha+ O\left(\sqrt{ \frac{b \log n}{n}}\right) + \frac{2^d}{n\log n} \approx \alpha$

where ${d}$ is the dimension of the set ${\mathbb{M}}$.

The paper contains several other methods for bounding ${W_\infty(\hat{\cal P},{\cal P})}$. Well, I hope I have piqued your interest. If so, have a look at the paper and let us know if you have any comments.

P.S. Sebastien Bubeck has a new blog, here

## Amanda Knox and Statistical Nullification

In today’s New York Times, Leila Schneps and Coralie Colmez correctly warn that

… math can become a weapon that impedes justice and destroys innocent lives.

They discuss Lucia de Berk, and Sally Clark, two unfortunate people who were convicted of crimes based on bogus statistical arguments. Statistician Richard Gill helped get de Berk’s conviction overturned.

So why do Schneps and Colmez argue that statistics should be used to help re-try Amanda Knox in a what will be another Italian travesty of justice?

They criticize the judge for bad statistical reasoning. Referring to a new test of DNA on a knife, they say:

His reasoning? If the scientific community recognizes that a test on so small a sample cannot establish identity beyond a reasonable doubt, then neither could a second test test on an even smaller sample.

They go on to argue that basic statistics tells us that further testing would “tell us something about the likely accuracy of the first result. Getting the same result after a third test would give yet more credence to the original finding.”

Indeed. It would confirm that she had touched a knife in her own apartment. I guess you would find my DNA all over the cutlery in my house. I hope no one uses that as evidence in a murder trial.

The Amanda Knox trial was a joke; it never would have made it to court in the U.S.

All of which raises a question. Who would testify as a statistical expert in a court case without taking into account the context? Even if I thought Schneps and Colmez were correct in their statistical reasoning, I would not testify on the narrow question of whether a second test would strengthen that particular piece of evidence if I thought the trial was a scam.

Some time ago, I was asked to be an expert witness on a statistical issue regarding random drug testing of high school students in Pennsylvania. The attorney working for the tyrants (the school district) tried to convince me that he had a good statistical case for doing such testing. I told him that, despite the attractive fee and the correctness of his statistical arguments, I would not testify in a case that helped the government harass its own citizens.

What Schneps and Colmez failed to mention was context. I don’t care how sound the statistical argument is. If it is being used for unethical and unjust means, we should refuse to testify. Call it Statistical Nullification.

## Shaking the Bayesian Machine

Brad is one of the most famous statisticians in the world and his contributions to the field of statistics are too numerous to list. Probably he is best known for: inventing the bootstrap, for starting the field of the geometry of statistical models, for least angle regression (an extension of the lasso he developed with Trevor Hastie, Iain Johnstone and Rob Tibshirani), for his work on empirical Bayes, large-scale multiple testing, and many other things.

He has won virtually every possible award including The MacArthur Award and The National Medal of Science.

So, as you can imagine, his visit was a big deal.

He spoke on Frequentist Accuracy of Bayesian Estimates. The key point in the talk was a simple formula for computing the frequentist standard error of a Bayes estimator. Suppose ${\theta = t(\mu)}$ is the parameter of interest for a statistical model

$\displaystyle \{ f_\mu(x):\ \mu\in\Omega\}$

and we want the standard error of the Bayes estimator ${\mathbb{E}(\theta|x)}$.

The formula is

$\displaystyle se = \left[ cov(t(\mu),\lambda\,|\,x)^T \, V \, cov(t(\mu),\lambda\,|\,x)\right]^{1/2}$

where

$\displaystyle \lambda = \nabla_x \log f_\mu(x)$

and

$\displaystyle V = Cov_\mu(X).$

In addition, one can compute the Bayes estimator (in some cases) by using a re-weighted parametric bootstrap. This is much easier than the usual approach based on Monte Carlo Markov Chain. The latter leads to difficult questions about convergence that are completely avoided by using the re-weighted bootstrap.

See the link to his talk for details and examples.

Now you might ask: why would we want the frequentist standard error of a Bayesian estimator? There are (at least) two reasons.

First, the frequentist can view Bayes as a way to generate estimators. And, like any estimator, we would want to estimate its standard error.

Second, the Bayesian who is not completely certain of his prior (which I guess is every Bayesian) might be interested in doing a sensitivity analysis. One way to do such a sensitivity analysis is to ask: how much would my estimator change if I changed the data? The standard error provides one way to answer that question. Brad called this Shaking the Bayesian Machine. Great phrase.

Following the seminar we had a great dinner. I fully participated in copious reminiscing among senior faculty, while my young colleague Ryan Tibshirani wondered what we were talking about. I’m definitely one of the old experienced guys in the department now.

I asked Brad if he’d like to write a guest post for this blog. He didn’t say yes but he didn’t say no either. Let’s keep our fingers crossed. (I’m happy to report that my colleague Steve Fienberg did commit to writing one.)

Bottom line: Brad is doing lots of cool stuff. Check out his web page.