## Monthly Archives: July 2012

### Differential Privacy

Differential Privacy

Privacy and confidentiality are of great concern in our era of Big Data. In this post, I want to discuss one formal approach to privacy, called differential privacy. The idea was invented by Dwork, McSherry, Nissim and Smith (2006). A nice review by Cynthia Dwork can be found here.

1. What Is It?

There is a long history of people proposing methods for releasing information while preserving privacy. But whenever someone released information, some smart cryptographer would find a way to crack the privacy.

A famous example is the Netflix Prize. Quoting from Wikipedia:

“Netflix has offered \$1,000,000 prize for a 10% improvement in its recommendation system. Netflix has also released a training dataset for the competing developers to train their systems. While releasing this dataset they had provided a disclaimer: To protect customer privacy, all personal information identifying individual customers has been removed and all customer ids have been replaced by randomly-assigned ids. Netflix is not the only available movie rating portal on the web; there are many including IMDB. In IMDB also individuals can register and rate movies, moreover they have the option of not keeping their details anonymous. Narayanan and Shmatikov had cleverly linked the Netflix anonymized training database with the Imdb database (using the date of rating by a user) to partly de-anonymize the Netflix training database.”

The problem is that you have to protect against all other information that a malicious agent might have access to.

Suppose we have data database consisting of data points ${D = \{X_1,\ldots, X_n\}}$ where ${X_i \in {\cal X}}$. We want to release some sort of statistical summary of the database but we do not want to compromise any individual in the database. Consider an algorithm ${A}$ that takes a database ${D}$ as input and then outputs some random quantity ${A(D)}$. We say that two databases ${D}$ and ${D'}$ are neighboring databases if they differ in one element. In that case we write ${D\sim D'}$.

The algorithm ${A}$ satisfies ${\epsilon}$ differential privacy if, for all sets ${S}$, and all pairs ${D\sim D'}$,

$\displaystyle P( A(D) \in S)\leq e^\epsilon \, P(A(D') \in S).$

The intuition is this. Suppose you are considering whether or not to allow yourself to be included in a database. If differential privacy holds, then the output barely depends on whether or not you agree to be in the database. More precisely, the distribution of the output has low dependence on any one individual.

2. Example

Suppose, to be concrete, that ${X_i \in [0,1]}$. Let ${f(D)}$ be a real-valued function of the database, such as the sample mean. Define the sensitivity

$\displaystyle \Delta = \sup_{D\sim D'} | f(D) - f(D')|$

where the supremum is over all neighboring pairs of data bases. In the case of the mean, ${\Delta = 1/n}$.

Now we release

$\displaystyle A(D) = f(D) + Z$

where ${Z}$ has a Laplace distribution with density ${g(z)= \epsilon/(2\Delta) e^{-|z|\epsilon/\Delta}}$. It then can be shown easily that ${\epsilon}$-differential privacy holds. (There are other ways to achieve differential privacy but this is the simplest.)

It is also possible to release a differentially private database. To do this, we first construct a differentially private density estimator ${\hat p}$. The we draw a sample ${Z_1,\ldots, Z_n \sim \hat p}$. The resulting data base ${A = (Z_1,\ldots, Z_n)}$ is also differentially private. See this and this for example.

3. What’s the Catch?

The field of differential privacy has grown very quickly. Do a Google search and you will find a goldmine of interesting papers.

But there is one problem. If the data are high-dimensional and you want to release a differentially private database, you will inevitably end up adding lots of noise. (You can see this by considering the simple but common situation of a multinomial distribution with many categories. The real data will have many empty categories but differential privacy forces you to fill in these zeroes with non-trivial probability.)

In short, differential privacy does not work well with high-dimensional data.

I attended a workshop on privacy at IPAM. The workshop was interesting because it brought together two very different groups of people. One group was computer scientists whose background was mostly in cryptography. The other group was applied statisticians. There was an interesting intellectual tension. The cryptographers wanted the strongest possible privacy guarantees. The statisticians wanted practical methods that worked well with large, messy datasets. I had sympathy for both positions.

I’m not an expert in this area but my current feeling is that differential privacy is too strong for many realistic settings. You end up outputting mostly noise. But I don’t yet know a good way to weaken differential privacy without giving up too much.

This is a fascinating area because it is of great practical interest, there is lots of interesting theory and it brings together theoretical computer science and statistics.

If you’re interested, I reiterate my suggestion to do a Google search on differential privacy. There’s a lot of stuff out there.

4. References

Cynthia Dwork, Frank McSherry, Kobbi Nissim, Adam Smith (2006). Calibrating Noise to Sensitivity in Private Data Analysis. In Theory of Cryptography Conference (TCC), Springer, 2006.

Cynthia Dwork (2006). Differential Privacy. International Colloquium on Automata, Languages and Programming (ICALP) 2006, p. 1-12.

—Larry Wasserman

### Statistical Principles?

Statistical Principles?
Larry Wasserman

There are some so-called principles of statistical inference that have names like, the Sufficiency Principle (SP), the Conditionality Principle (CP) and the Likelihood Principle (LP).

Birnbaum (1962) proved that CP and SP imply LP. (But see Mayo 2010). Later, Evans Fraser and Monette (1986) proved that CP alone implies LP (so SP is not even needed).

All of this generates controversy because CP (and SP) seem sensible. But LP is not acceptable to most statisticians. Indeed, all of frequentist inference violates LP, so if we adhered to LP we would have to abandon frequentist inference. In fact, as I’ll explain below, LP pretty much rules out Bayesian inference contrary to the claims of Bayesians.

How can CP be acceptable and LP not be acceptable when CP logically implies LP?

The reason is that the principles are bogus. What I mean is that, CP might seem compelling in a few toy examples. That doesn’t mean it should be elevated to the status of a principle.

1. The Principles

SP says that: if two experiments yield the same value for a sufficient statistic, then the two experiments should yield the same inferences.

CP says that: if I flip a coin to choose which of two experiments to conduct, then inferences should depend only on the observed experiment. The fact that I could have chosen the other experiment should not affect inferences. In more technical language, the coin flip is ancillary, (its distribution is completely known), and inferences should be conditional on the ancillary.

LP says that: two experiments that yield proportional likelihood functions should yield identical inferences.

Frequentist inference violates LP because things like confidence intervals and p-values depend on the sampling distributions of estimators and so on, which involves more than just the observed likelihood function.

Bayesians seem to embrace LP and indeed use it as an argument for Bayesian inference. But two Bayesians with the same likelihood can get different inferences because they might have different priors (and hence different posterior distributions). This violates LP. Whenever I say this to people, the usual reply is: but Birnbaum’s theorem only applies to one person at a time. But this is not true. There is no hidden label in Birnbaum’s theorem that says: Hey, this theorem only applies to one person at a time.

2. CP Is Bogus

Anyway it doesn’t matter. The main point is that CP (and hence LP) is bogus. Just because it seems compelling that we should condition on the coin flip in the simple mixture example above, it does not follow that conditioning is always good. Making a leap from a simple, toy example, to a general principle of inference is not justified.

Here is a simple example. I think I got it from Jamie Robins. You observe
${(X_1,Y_1), \ldots, (X_n,Y_n)}$ where

$\displaystyle Y_i = \beta^T X_i + \epsilon_i$

and ${\epsilon_i \sim N(0,1)}$. To be concrete, let’s say that ${n=100}$ but each ${X_i = (X_{i1},\ldots, X_{id})}$ is a vector of length ${d}$ and ${d}$ is huge; ${d=100,000}$ for example. We want to estimate ${\beta_1}$. This is just linear regression with a large number of covariates.

Suppose we have some extra information: we are told that the covariates are independent. The “best” estimator (the maximum likelihood estimator) is obtained by conditioning on all the data.

This means we should estimate the vector ${\beta}$ by least squares. But, the least squares estimator is useless when ${d> n}$. We could regularize by putting a penalty or a prior. But the resulting estimators will have terrible behavior compared to the following “anti-conditioning” estimator. Just throw away most of the data. In particular, throw away all the covariates except the first one. Now do linear regression using only ${Y}$ and the first covariate. The resulting estimator ${\hat\beta_1}$ is then tightly concentrated around ${\beta_1}$ with high probability. (This is because of the independence.)

In this example, throwing away data is much better than conditioning on the data. We are heavily violating LP.

There are lots of other examples of great procedures that violate LP. Randomization is a good example. Methods based on randomization (such as permutation tests) are wonderful things but adherents to CP (and hence LP) are precluded from using them. The same applies to data-splitting techniques.

The bottom line is this: if we elevate lessons from toy examples into grand principles we will be led astray.

Postscript: Since I mentioned Jamie, I should alert you that in the near future, I’ll be cross-posting on this blog and Cosma’s blog about a debate between me and Jamie versus a Nobel prize winner. Stay tuned.

Evans, M.J., Fraser, D.A.S. and Monette, G. (1986). On principles and arguments to likelihood. Canadian Journal of Statistics, 14, 181-194.

Birnbaum, Allan (1962). “On the foundations of statistical inference”. J. Amer. Statist. Assoc., 57 269-326.

Mayo, D. (2010). An Error in the Argument from Conditionality and Sufficiency to the Likelihood Principle. In Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability and the Objectivity and Rationality of Science (D Mayo and A. Spanos eds.), 305-14.

### Perfunctory Examples and the Role of Applications: A Rant

Since I am working on revising of a paper (always an annoying task) this seems like a good time for rant. In particular, I want to complain about the perfunctory example.

1. Perfunctory Examples

Imagine this: you submit a paper to a journal (or a conference). The reviews are positive. But at least one reviewer feels compelled to add: “please add a real example” or “please add more simulations.”

Now, if adding an example or another simulation would truly make the paper better, then fine. But it’s my experience that the “add a real example” reaction is more of a reflex action. The reviewer has not asked himself or herself: does this paper really need an example (or another simulation)?

The result of these requests for examples is this: our journals are full of papers with perfunctory examples and simulations that add nothing to the paper. Worse, researchers waste weeks of precious time satisfying the arbitrary whims of a reviewer.

Let me repeat: I do think examples and simulations can sometimes make a paper better. But often not. It is just a waste of time. There just seems to be a basic, unquestioned assumption that a paper with a “real example” is publishable while a paper without one is not.

2. The Role of Applications

Now I want to clarify that I do think applications play a critical role in statistics and machine learning. In my opinion, theory is interesting when it is motivated, even indirectly, by real applications. (Just as theoretical physics is ultimately based on trying to understand the world we live in.)

Kiri Wagstaff has a nice paper called Machine Learning That Matters at ICML. Her abstract begins Much of current machine learning (ML) research has lost its connection to problems of import to the larger world of science and society. This is very interesting to me because many years ago (perhaps in the 1980’s?) similar discussions took place in the field of statistics. I think there was a general consensus that statistics was putting too much emphasis on theory for its own sake and needed to get back to its roots.

The problem is this: while I do think that good theory is ultimately rooted in applications, the link between applications and theory can be subtle. Solving applied problems leads to the development of new methods. Then people start to ask questions about the behavior of these methods. This leads to new theory. But the chain linking theory to applications can be a long, complex, very non-linear route. Simply adding an example into a paper is not going to illuminate the connection.

3. Conclusion

To me, theory is interesting if it explains, creates or casts some light on methodology. Judging whether theory is interesting is subtle and, yes, also a matter of taste. But there is a temptation to reduce it to: is there an example in the paper?

How can we make sure that statistics and machine learning stay anchored in the real world without simply requiring that people add some bogus example to their papers? Kiri Wagstaff offers some ideas, including six impact challenges. I’d like to know what other people think about it.

One possible answer is: it doesn’t matter. People should work on, and publish, whatever they want. On average, good theory will get noticed and used; less useful theory will attract less attention. In other words, let the field find its own direction. We shouldn’t try to direct it. (This happens to be my opinion but I suspect many will disagree.)

In the meantime, let’s have a moratorium on reviewers asking for more examples and simulations, just for the sake of it.

And to borrow from Dennis Miller, the master of rants, I’ll conclude with:

That’s just my opinion, I could be wrong.

—Larry Wasserman

### The Amazing Mean Shift Algorithm

The mean shift algorithm is a mode-based clustering method due to Fukunaga and Hostetler (1975) that is commonly used in computer vision but seems less well known in statistics.

The steps are: (1) estimate the density, (2) find the modes of the density, (3) associate each data point to one mode.

1. The Algorithm

We are given data ${Y_1,\ldots, Y_n \in \mathbb{R}^d}$ which are a sample from a density ${p}$. Let

$\displaystyle \hat p(y) = \frac{1}{n}\sum_{i=1}^n \frac{1}{h^d} K\left(\frac{||y - Y_i||}{h}\right)$

be a kernel density estimator with kernel ${K}$ and bandwidth ${h}$. Let ${g(y) = \nabla \hat p(y)}$ denote the gradient of ${\hat p(y)}$.

Suppose that ${\hat p}$ has modes ${m_1,\ldots, m_k}$. (Note that ${k}$ is not determined in advance; it is simply the number of modes in the density estimator.) An arbitrary point ${y\in\mathbb{R}^d}$ is assigned to mode ${m_j}$ if the gradient ascent curve through ${y}$ leads to ${m_j}$. More formally, the gradient ${g}$ defines a family of integral curves ${\pi}$. An integral curve ${\pi:\mathbb{R}\rightarrow\mathbb{R}^d}$ is defined by the differential equation

$\displaystyle \pi'(t) = \nabla g(\pi(t)).$

In words, moving along ${\pi}$ corresponds to moving up in the direction of the gradient. The integral curves partition the space.

For each mode ${m_j}$, define the set ${A_j}$ to be the set of all ${y}$ such that the integral curve starting at ${y}$ leads to ${m_j}$. Intuitively, if we place a particle at ${y\in A_j}$ and let it flow up ${\hat p}$, it will end up at ${m_j}$. That is, if ${\pi}$ is the integral curve starting at ${y}$, then ${\lim_{t\rightarrow\infty} \pi(t) =m_j}$. The sets ${A_1,\ldots, A_k}$ form a partition. We can now partition the data according to which sets ${A_j}$ they fall into. In other words, we define clusters ${C_1,\ldots, C_k}$ where ${C_j = \{ Y_i:\ Y_i \in A_j\}}$.

To find the clusters, we only need to start at each ${Y_i}$, move ${Y_i}$ along its gradient ascent curve ${\pi}$ and see which mode it goes to. This is where the mean shift algorithm comes in.

The mean shift algorithm approximates the integral curves (i.e. the gradient ascent curves). Although it is usually applied to the data ${Y_i}$, it can be applied to any set of points. Suppose we pick some point ${w}$ and we want to find out which mode it belongs to. We set ${w_0=w}$ and then, for ${t=0,1,2,\ldots}$, set

$\displaystyle w_{t+1} \longleftarrow \frac{\sum_{i=1}^n Y_i K\left(\frac{||w_t - Y_i||}{h}\right)} {\sum_{i=1}^n K\left(\frac{||w_t - Y_i||}{h}\right)}.$

In other words, replace ${w_t}$ with a weighted average around ${w_t}$ and repeat. The trajectory ${w_0, w_1, w_2,\ldots,}$ converges to a mode ${m_j}$.

In practice, we run the algorithm, not on some arbitrary ${w}$ but on each data point ${Y_i}$. Thus, by running the algorithm you accomplish two things: you find the modes and you find which mode each data point ${Y_i}$ goes to. Hence, you find the mode-based clusters.

The algorithm can be slow, but the literature is filled with papers with speed-ups, improvements etc.

Here is a simple example:

The big black dots are data points. The blue crosses are the modes. The red curves are the mean shift paths. Pretty, pretty cool.

2. Some Theory

Donoho and Liu (1991) showed that, if ${p}$ behaves in a locally quadratic fashion around the modes, then the minimax rate for estimating modes is ${n^{-\frac{1}{4+d}}}$ and that the rate is achieved by kernel density estimators. Thus, the above procedure has some theoretical foundation. (Their result s for ${d=1}$ but I am pretty sure it extends to ${d>1}$.)

More recently, Klemel{ä} (2005) showed that it is possible to obtain minimax estimators of the mode that adapt to the local behavior of ${p}$ around the modes. He uses a method due to Lepski (1992) that is a fairly general way to do adaptive inference.

My colleagues (Chris Genovese, Marco Perone-Pacifico and Isa Verdinelli) and I have been working on some extensions to the mean shift algorithm. I’ll report on that when we finish the paper.

I’ll close with a question: why don’t we use this algorithm more in statistical machine learning?

3. References

Cheng, Yizong (1995). Mean Shift, Mode Seeking, and Clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17, 790-799.

Donoho, D.L. and Liu, R.C. (1991). Geometrizing rates of convergence, II. The Annals of Statistics, 633-667.

Comaniciu, Dorin; Peter Meer (2002). Mean Shift: A Robust Approach Toward Feature Space Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 603-619.

Fukunaga, Keinosuke; Larry D. Hostetler (1975). The Estimation of the Gradient of a Density Function, with Applications in Pattern Recognition. IEEE Transactions on Information Theory, 21, 32-40.

Klemel{ä}, J. (2005). Adaptive estimation of the mode of a multivariate density. Journal of Nonparametric Statistics, 17, 83-105.

Lepski, O.V. (1992). On problems of adaptive estimation in white Gaussian noise. Topics in nonparametric estimation, 12, 87-106.

—Larry Wasserman

### Minimax Theory Saves The World

Minimax theory is the best thing in statistical machine learning—or the worst— depending on your point of view.

1. The Basics of Minimaxity

Let ${{\cal P}}$ be a set of distributions. We observe ${Y_1,\ldots, Y_n \sim P \in {\cal P}}$. Let ${\theta = \theta(P)}$ be a function of the distribution (like the mean of ${P}$). The minimax risk is

$\displaystyle R_n = \inf_{\hat \theta}\sup_{P\in {\cal P}} \mathbb{E}_P [ d(\hat\theta,\theta(P))]$

where the infimum is over all estimators ${\hat\theta}$ (an estimator is any function of the observed data) and ${d(\cdot,\cdot)}$ is some distance (more generally, a loss function). In machine learning, one often reports the sample complexity instead of the minimax risk. The sample complexity is the smallest sample size needed to make ${R_n}$ less than some small number.

The challenge is to compute (or approximate) the minimax risk ${R_n}$ and to find an estimator ${\hat\theta}$ that achieves the minimax risk. An estimator ${\hat\theta}$ whose maximum risk

$\displaystyle \sup_{P\in {\cal P}} \mathbb{E}_P [ d(\hat\theta,\theta(P))]$

is equal to ${R_n}$, is a minimax estimator. It has the smallest possible maximum (worst case) risk.

Here is a famous example. Let ${Y_1,\ldots, Y_n \sim {\rm Normal}(\theta,1)}$. Suppose that ${d}$ has the form ${d(a,b) = g(a-b)}$ where ${g}$ is any bowl-shaped function. This means that the sub-level sets ${\{ x:\ g(y) \leq c\}}$ are convex and symmetric about the origin. Then Wolfowitz (1950) proved that the unique estimator that is minimax for all bowl-shaped ${g}$ is the sample mean ${\overline{Y}_n}$.

Here is a more sophisticated example: density estimation. Let ${Y_1,\ldots, Y_n \sim P}$ be random variables where ${P}$ has density ${p}$. We want to know the minimax risk

$\displaystyle R_n = \inf_{\hat p}\sup_{P\in {\cal P}} \mathbb{E}_P \int (\hat p(y) - p(y))^2 dy$

where ${{\cal P}}$ is the set of densities such that ${\int (p''(y))^2 dy \leq L^2}$. It turns out that ${R_n = c n^{-\frac{4}{5}}}$ for some constant ${c>0}$. The kernel density estimator

$\displaystyle \hat p(y) = \frac{1}{n}\sum_{i=1}^n \frac{1}{h}K\left( \frac{|y-Y_i|}{h}\right)$

achieves this rate of convergence (for appropriate ${K}$ and ${h}$). Histograms, however, do not. They converge more slowly. So we have discovered that, for the class ${{\cal P}}$, kernel density estimators are better than histograms and we cannot do much better than kernel estimators.

Now consider a set of spaces ${\{\cal P_\alpha\}}$ indexed by a parameter. Think of ${\alpha}$ as a measure of smoothness. These spaces could have different minimax rates. Sometimes it is possible to construct estimators that achieves the minimax rate of convergence without knowing the true value of ${\alpha}$. In other words, these estimators automatically adapt to the right amount of smoothness and they are called adaptive estimators. A famous example is wavelet estimation (Donoho, Johnstone, Kerkyacharian and Picard 1995).

2. Technical Digresson

To find ${R_n}$, we usually find an upper bound ${R_n^*}$ and a lower bound ${R_{n*}}$ and then, if we are lucky, ${R_{n*} \approx R_n^*}$.

To find an upper bound, pick any estimator ${\hat\theta}$. The maximum risk of any estimator is an upper bound:

$\displaystyle R_n \leq \sup_{P\in {\cal P}} \mathbb{E}_P [ d(\hat\theta,\theta(P))] \equiv R_n^*.$

Finding a lower bound is trickier. Typically one uses an information-theoretic inequality like Fano’s inequality. In particular, we pick a finite set of distributions ${P_1,\ldots, P_N \in {\cal P}}$. Then

$\displaystyle R_n \geq \frac{\alpha}{2} \left( 1 - \frac{ n \beta + \log 2}{\log N}\right) \equiv R_{n*}$

where

$\displaystyle \alpha = \min_{j\neq k}d(\theta(P_j),\theta(P_k)),$

$\displaystyle \beta = \max_{j\neq k} {\sf KL}(P_j,P_k)$

and ${{\sf KL}(P_j,P_k)}$ is the Kullback-Leibler distance between ${P_j}$ and ${P_k}$. (Sometimes you need to prune the original finite set of distributions using the Varshamov-Gilbert lemma. See Tsybakov 2009).

Choosing ${\hat\theta}$ and the finite set ${P_1,\ldots, P_N}$ is an art. If you don’t choose wisely, the bounds will not be tight.

3. In Favor Of Minimaxity

The purpose of minimax analysis is to understand the difficulty of a problem. Here is a small sample of the types of questions that minimax theory has cast light on recently:

1. How hard is it to find the non-zero variables in a high-dimensional regression?
2. How hard is it to estimate an undirected graph?
3. How hard is it to estimate the homology of a manifold?
4. How hard is it to test a hypothesis about a high-dimensional Normal mean?
5. How hard is it to estimate a sparse signal from compressed measurements?

The virtue of minimax theory is that it gives a precise notion of the intrinsic difficulty of a problem. And it allows you to evaluate an estimator against a benchmark: if your estimator is not minimax then you can search for a better estimator.

4. Against Minimaxity

There are some valid criticisms of minimax theory. You have to specify a loss function. You have to specify a class of distributions ${{\cal P}}$. And you are using the worst case to define the value of a procedure.

Having to specify a loss and a class ${{\cal P}}$ does not seem like a real problem to me. Minimax theory—or for that matter, all decision theory— is relative to some loss and some class. When we say that a problem is hard, we mean it is hard under certain assumptions. A problem may be easier or harder relative to different assumptions.

The claim that minimax theory is driven by the worst case is a more substantial criticism. I happen to think worst case analysis is a good thing. I want an estimator that does reasonably well under any circumstances.

And for nonparametric problems, where ${{\cal P}}$ is a complicated infinite dimensional object, I would be skeptical of anything other than worst case analysis.

Think of it this way: if I buy a car, and I know that the car was tested and found to perform well even under extremely difficult circumstances, then I’ll feel better about buying the car.

Having said that, there are plenty of people who are uncomfortable with the worst case nature of minimax theory.

5. Minimaxity Gets the Last Word?

Minimax theory was important in statistics for a while then it sort of faded. But it had a revival, partly due to the work of Donoho et al.

But what is really interesting is that I see more and more minimax analyses in ML outlets such as NIPS, JMLR etc. So perhaps the marketplace of ideas is telling us that minimax theory is gaining in importance?

6. References

Donoho, D.L., Johnstone, I.M., Kerkyacharian, G. and Picard, D. (1995). Wavelet shrinkage: asymptopia?. Journal of the Royal Statistical Society. Series B. 301–369.

Tsybakov, A.B. (2009). Introduction to Nonparametric Estimation.

Wasserman, L. (2006). All of Nonparametric Statistics.

Wolfowitz, J. (1950). Minimax estimates of the mean of a normal distribution with known variance. The Annals of Mathematical Statistics, 21, 218-230.

—Larry Wasserman

### Modern Two-Sample Tests

When you are a student, one of the first problems you learn about is the two-sample test. So you might think that this problem is old news. But it has had a revival: there is a lot of recent research activity on this seemingly simple problem. What makes the problem still interesting and challenging is that, these days, we need effective high dimensional versions.

I’ll discuss three recent innovations: kernel tests, energy tests, and cross-match tests.

This post will have some ML ideas that are mostly ignored in statistics and as well as some statistics ideas that are mostly ignored in ML.

(By the way, much of what I’ll say about two-sample tests also applies to the problem of testing whether two random variables ${X}$ and ${Y}$ are independent. The reason is that testing for independence really amounts to tasting whether two distribution are the same, namely, the joint distribution ${P_{X\times Y}}$ for ${(X,Y)}$ and the product distribution ${P_X \times P_Y}$.)

1. The Problem

We observe two independent samples:

$\displaystyle X_1,\ldots, X_n \sim P$

and

$\displaystyle Y_1,\ldots, Y_m \sim Q.$

(These are random vectors.) The problem is to test the null hypothesis

$\displaystyle H_0: P=Q$

versus the alternative hypothesis

$\displaystyle H_1: P\neq Q.$

The various tests that I describe below all define a test statistic ${T}$ which s a function of the data ${X_1,\ldots, X_n ,Y_1,\ldots, Y_m}$. As usual, we reject ${H_0}$ if ${T > t}$ for some critical value ${t}$. We choose ${t}$ such that, if ${H_0}$ is true then

$\displaystyle W(T>t) \leq \alpha$

where ${W}$ is the distribution of ${T}$ when ${H_0}$ is true.

Some of the test statistics ${T}$ can be pretty complicated. This raises the question of how we choose ${t}$. In other words, how do we find the distribution ${W}$ of ${T}$ under ${H_0}$?

There are four ways:

1. We may be able to find ${W}$ explicitly.
2. We may be able to find the large sample limit of ${W}$ and use this as an approximation.
3. Bootstrapping.
4. Permutation testing.

The coolest, most general and most amazing approach is number 4, the permutation approach. This might surprise you, but I think the permutation approach to testing is one of the greatest contributions of statistics to science. And, as far as I can tell, it is not well known in ML.

Since the permutation approach can be used for any test, I’ll describe the permutation method before describe any of the test statistics.

2. The Amazing Permutation Method

Let ${T}$ be any test statistic. It doesn’t matter what the statistic is or how complicated it is. Let us organize the data as follows:

0 ${X_1}$
0 ${X_2}$
${\vdots}$ ${\vdots}$
0 ${X_n}$
1 ${Y_1}$
1 ${Y_2}$
${\vdots}$ ${\vdots}$
1 ${Y_m}$

We have added group labels to indicate the two samples. The test statistic ${T}$, whatever it is, is a function of the data vectors ${D=(X_1,\ldots,X_n,Y_1,\ldots, Y_m)}$ and the group labels ${G=(0,\ldots, 0,1,\ldots, 1)}$.

If ${H_0}$ is true, then the entire data vector is an iid sample from ${P}$ and the group labels are meaningless. It is as if someone randomly wrote down some 0’s and 1’s.

Now we randomly permute the group labels and recompute the test statistic. This changes the values of ${T}$ but (under ${H_0}$) it does not change the distribution of ${T}$. We then repeat this process ${N}$ times (where ${N}$ is some large number like 10,000). Let us denote the values of the test statistics from the permuted data by ${T_1,\ldots, T_N}$. Since the assignment of the 0’s and 1’s is meaningless under ${H_0}$, the ranks of ${T,T_1,\ldots, T_N}$ are uniformly distributed. That is, if we order these values, then the original statistics ${T}$ is equally likely to be anywhere in this ordered list.

Finally, we define the p-value

$\displaystyle p = \frac{1}{N}\sum_{j=1}^n I(T_j \geq T)$

where ${I}$ is the indicator function. So ${p}$ is just the fraction of times ${T_j}$ is larger than ${T}$. Because the ranks are uniform, ${p}$ is uniformly distributed (except for a bit of discreteness). If we reject ${H_0}$ when ${p < \alpha}$, we see that the type I error, that s the probability of a false rejection, is no more than ${\alpha}$.

There are no approximations here. There are no appeals to asymptotics. This is exact. No matter what the statistic ${T}$ is and no mater what the distribution ${P}$ is, if we use the permutation methodology, we get correct type I error control.

Now we can discuss the test statistics.

3. Kernel Tests

For a number of years, a group of researchers has used kernels and reproducing kernel Hilbert spaces (RKHS) to define very general tests. Some key references are Gretton, Fukumizu, Harchaoui, Sriperumbudur (2012), Gretton, Borgwardt, Rasch, Sch{ö}lkopf and Smola (2012) Sejdnovic, Gretton, Sriperumbudur and Fukumizu (2012).

To start with we choose a kernel ${K}$. It will be helpful to have a specific kernel in mind so let’s consider the Gaussian kernel

$\displaystyle K_h(x,y) = \exp\left( - \frac{||x-y||^2}{h^2}\right).$

The test statistic is

$\displaystyle T = \frac{1}{n^2}\sum_{i=1}^n \sum_{j=1}^n K_h(X_i,X_j)- \frac{2}{mn}\sum_{i=1}^n \sum_{j=1}^m K_h(X_i,Y_j)+ \frac{1}{m^2}\sum_{i=1}^n \sum_{j=1}^n K_h(Y_i,Y_j).$

There is some deep theory justifying this statistic (see above references) but the intuition is clear. Think of ${K_h(x,y)}$ as a measure of similarity between ${x}$ and ${y}$. Then ${T}$ compares the typical similarity of a pair of points from the same group versus the similarity of a pair of points from different groups.

The aforementioned authors derive many interesting properties of ${T}$ including the limiting distribution. But, for practical purposes, we don’t need to know the distribution. We simply apply the permutation method described earlier.

If you are paying close attention, you will have noticed that there is a tuning parameter ${h}$ so we should really write the statistic as ${T_h}$. How do we choose this?

We don’t have to. We can define a new statistic

$\displaystyle T = \sup_{h\geq 0} T_h.$

This new statistic has no tuning parameter. The null distribution is a nightmare. But that doesn’t matter. We simply apply the permutation method to the new statistic ${T}$.

4. Energy Test

Szekely and Rizzo (2004, 2005) defined a test (which later they turned into a test for independence: see Szekely and Rizzo 2009) based on estimating the following distance between ${P}$ and ${Q}$:

$\displaystyle D(P,Q) = 2 E||X-Y|| - E||X-X'|| - E||Y-Y'||$

where ${X,X' \sim P}$ and ${Y,Y' \sim Q}$. The test statistic ${T}$ is a sample estimate of this distance. The resulting test has lots of interesting properties.

Now, ${D(P,Q)}$ is based on the Euclidean metric and obviously this can be generalized to other metrics ${\rho}$. Sejdnovic, Gretton, Sriperumbudur and Fukumizu (2012) show that, under certain conditions on ${\rho}$, the resulting test corresponds exactly to some kernel test. Thus we can consider the energy test as a type of kernel test.

5. The Cross-Match Test

There is a long tradition of defining two sample tests based on certain “local coincidences.” I’ll describe just one: the cross-match test due to Paul Rosenbaum (2005).

The test works as follows. Ignore the labels and treat the data as one sample ${Z_1,\ldots, Z_{n+m}}$ of size ${n+m}$. Now form a non-bipartite matching. That is, put the data into non-overlapping pairs. For example, a typical pairing might look like this:

$\displaystyle \{Z_3, Z_8\},\ \{Z_{17}, Z_6\}, \ldots$

(For simplicity, assume there are an even number of observations.) However, we don’t pick just any matching. We choose the matching that minimizes the total distances between the points in each pair. (Any distance can be used.)

Now we look at the group labels. Some pairs will have labels ${(0,0)}$ or ${(1,1)}$ while some will have labels of the form ${(0,1)}$ or ${(1,0)}$. Let ${a_0}$ be the number of pairs of type ${(0,0)}$, let ${a_2}$ be the number of pairs of type ${(1,1)}$, and let ${a_1}$ be the number of pairs of type ${(0,1)}$ or ${(1,0)}$.

Define ${T=a_1}$. Under ${H_0}$, it is as if someone randomly sprinkled 0’s and 1’s onto the data and we can easily compute the exact distribution of ${T}$ under ${H_0}$ which is

$\displaystyle P(T=t) = \frac{2^t N!}{\binom{N}{m} a_0!\, a_1! \, a_2!}$

where ${N = a_0 + a_1 + a_2}$.

In fact, this distribution can be accurately approximated by a Normal with mean ${nm/(N-1)}$ and variance ${2 n (n-1)m (m-1)/( (N-3)(N-1)^2)}$. (Note that approximating a known, exact, distribution which contains no nuisance parameters is quite different than using a test whose null distribution is exact only in the limit.)

So the nice thing about Rosenbaum’s test is that it does not require the permutation methods. This might not seem like a big deal but it is useful if you are doing many tests.

6. Which Test Is Best?

There is no best test, of course. More precisely, the power of the test depends on ${P}$ and ${Q}$. No test will uniformly dominate another test.

If you are undecided about which test to use, you can always combine them into one meta-test. Then you can use the permutation method to get the p-value.

Having said that, I think there is room for more research about the merits of various tests under different assumptions, especially in the high-dimensional case.

7. References

Gretton, A. and Fukumizu, K. and Harchaoui, Z. and Sriperumbudur, B.K. (2012). A fast, consistent kernel two-sample test. Advances in Neural Information Processing Systems, 673–681.

Gretton, A., Borgwardt, K.M., Rasch, M.J., Sch{ö}lkopf and Smola, A. (2012). A kernel two-sample test. The Journal of Machine Learning Research, 13, 723–773.

Rosenbaum, P.R. (2005). An exact distribution-free test comparing two multivariate distributions based on adjacency. Journal of the Royal Statistical Society: Series B, 67, 515-530.

Sejdnovic, D., Gretton, A., Sriperumbudur, B. and Fukumizu, K. (2012). Hypothesis testing using pairwise distances and associated kernels. link

Sz{é}kely, G.J. and Rizzo, M.L. (2005). A new test for multivariate normality. Journal of Multivariate Analysis, 93, 58-80.

Szekely, G.J. and Rizzo, M.L. (2004). Testing for equal distributions in high dimension. InterStat, 5.

Szekely, G.J. and Rizzo, M.L. (2009). Brownian distance covariance. The Annals of Applied Statistics, 3, 1236-1265.

—Larry Wasserman

### The Higgs Boson and the p-value Police

The recent announcement of the discovery of the Higgs boson brought the inevitable and predictable complaints from statisticians about p-values.

Now, before I proceed, let me say that I agree that p-values are often misunderstood and misused. Nevertheless, I feel compelled to defend our physics friends, and even the journalists, from the p-value police.

The complaints come from frequentists and Bayesians alike. And many of the criticisms are right. Nevertheless, I think we should cease and desist from our p-value complaints.

Some useful commentary on the reporting can be found here, here and and here. For a look at the results, look here.

Here is a list of some of the complaints I have seen together with my comments.

1. The most common complaint is that physicists and journalists explain the meaning of a p-value incorrectly. For example, if the p-value is 0.000001 then we will see statements like “there is a 99.9999% confidence that the signal is real.” We then feel compelled to correct the statement: if there is no effect, then the chance of something as or more extreme is 0.000001.

Fair enough. But does it really matter? The big picture is: the evidence for the effect is overwhelming. Does it really matter if the wording is a bit misleading? I think we reinforce our image as pedants if we complain about this.

2. The second complaint comes from the Bayesian community that we should be reporting ${P(H_0| {\rm data})}$ rather than a p-value. Like it or not, frequentist statistics is, for the most part, the accepted way of doing statistics for particle physics. If we go the Bayesian route, what priors will they use? They could report lots of answers corresponding to many priors. But Forbes magazine reports that it cost about 13 billion dollars to find the Higgs. For that price, we deserve a definite answer.
3. A related complaint is the people naturally interpret p-values as posterior probabilities so we should use posterior probabilities. But that argument falls apart because we can easily make the reverse argument. Teach someone Bayesian methods and then ask them the following question: how often does your 95 percent Bayesian interval contain the true value? Inevitably they say: 95 percent. The problem is not that people interpret frequentist statements in a Bayesian way. The problem is that they don’t distinguish them. In other words: people naturally interpret frequentist statements in a Bayesian way but they also naturally interpret Bayesian statements in a frequentist way.
4. Another complaint I here about p-values is that their use leads to too many false positives. In principle, if we only reject the null when the p-value is small we should not see many false positives. Yet there is evidence that most findings are false. The reasons are clear: non-significant studies don’t get published and many studies have hidden biases.

But the problem here is not with p-values. It is with their misuse. Lots of people drive poorly (sometimes with deadly consequences) but we don’t respond by getting rid of cars.

My main message is: let’s refrain from nit-picking when physicists (or journalists) report a major discovery and then don’t describe the meaning of a p-value correctly.

Now, having said all that, let me add a big disclaimer. I don’t use p-values very often. No doubt they are overused. Indeed, it’s not just p-values that are overused. The whole enterprise of hypothesis testing is overused. But, there are times when it is just the right tool and the search for the Higgs is a perfect example.

—Larry Wasserman

### The Tyranny of Tuning Parameters

The Tyranny of Tuning Parameters

We all know about the curse of dimensionality. The curse creates serious practical and theoretical difficulties in many aspects of statistics. But there is another pernicious problem that gets less attention. I call it: The Tyranny of Tuning Parameters.

Many (perhaps most) data analysis methods involve one or more tuning parameters. Examples are: the ${k}$ in ${k}$-means clustering, the regularization parameter in the lasso the bandwidth in density estimation, the number of neighbors in ${k}$-nearest neighbors, the number of eigenvectors in PCA, and so on.

For some of these problems we have good data-driven methods to choose the parameters. But for others we don’t. For example, the optimal choice of tuning parameter for the lasso requires knowledge that we would never have in practice.

1. Regression: A Success

An example of a success is cross-validation in regression. Here is one version. Suppose the data are ${(X_1,Y_1),\ldots, (X_{2n},Y_{2n})}$. Split the ${2n}$ data points into a training set ${{\cal T}}$ of size ${n}$ and a test set ${{\cal U}}$ of size ${n}$. From the training set we create regression estimators ${\{\hat m_h:\ h\in {\cal H}\}}$ where ${h}$ is a tuning parameter and ${{\cal H}}$ is a finite set of values for ${h}$. For example, ${h}$ could be the bandwidth for kernel regression, or the regularization parameter for the lasso.

If ${(X,Y)}$ is a new observation, the mean prediction error is

$\displaystyle \mathbb{E} (Y- \hat m_h(X))^2.$

Minimizing the prediction error is the same as minimizing the ${L_2}$ risk:

$\displaystyle R(h) = \int (\hat m_h(x) - m(x))^2 dP(x)$

where ${m(x) = \mathbb{E}(Y|X=x)}$ is the true regression function.

Estimate the risk using the test set:

$\displaystyle \hat R(h) = \frac{1}{n}\sum_{(X_i,Y_i)\in {\cal U}} (Y_i - \hat m_h(X_i))^2.$

Choose ${\hat h}$ to minimize ${\hat R(h)}$ and let ${\hat m = m_{\hat h}}$. To avoid technical complications, assume all the random variables and estimators are bounded by ${L < \infty}$. We then have the following theorem (Theorem 7.1, Gy\”{o}rfi, Kohler, Krzy\'{z}ak and Walk, 2002):

Let ${m_* = m_{h_*}}$ be the oracle, that is, ${h_*}$ minimizes ${\int (\hat m_h(x) - m(x))^2 dP(x)}$ over ${h\in {\cal H}}$. Let ${N}$ be the size of ${{\cal H}}$. For any ${\delta>0}$,

$\displaystyle R(\hat m) \leq (1+\delta) R(m_*) + L^2 \left(\frac{16}{\delta} + 35 + 19\delta\right) \left(\frac{ 1 + \log N}{n}\right)$

where

$\displaystyle R(\hat m) = \mathbb{E} \int (\hat{m}(x) - m(x))^2 dP(x)$

and

$\displaystyle R(m_*)=\mathbb{E} \int (m_*(x) - m(x))^2 dP(x).$

In words: if you use cross-validation to use the tuning parameter ${h}$, you lose at most a factor of size ${O(\log N/n)}$ which is pretty small.

(Note: I focused on sample splitting but of course there are lots of other versions of cross-validation: leave-one-out, 10 fold etc.)

This is an example of a successful data-driven method for choosing the tuning parameter. But there are caveats. The result refers only to prediction error. There are other properties we might want; I’ll save that for a future post.

2. Density Estimation: A Failure

Now consider density estimation. Let ${Y_1,\ldots, Y_n \sim P}$ where ${P}$ has density ${p}$ and ${Y_i\in \mathbb{R}^d}$. The usual kernel density estimator is

$\displaystyle \hat p_h(y) =\frac{1}{n}\sum_{i=1}^n \frac{1}{h^d} K\left(\frac{||y-Y_i||}{h}\right)$

where ${K}$ is a kernel and ${h>0}$ is a bandwidth.

We want to choose ${h\in {\cal H}}$ where ${{\cal H}}$ has ${N}$ element. To choose ${h}$ one usual focuses on minimizing the ${L_2}$ loss

$\displaystyle \int (\hat p_h(y) - p(y))^2 dy = \int \hat p_h^2(y)dy - 2 \int \hat p_h(y) p(y) dy + \int p^2(y) dy$

which is equivalent to minimizing

$\displaystyle \int \hat p_h^2(y)dy - 2 \int \hat p_h(y) p(y) dy.$

The latter can be estimated using test data ${Y_1^*,\ldots, Y_n^*}$ by

$\displaystyle \int \hat p_h^2(y)dy - \frac{2}{n}\sum_{i=1}^n \hat p_h(Y_i^*).$

If we choose ${h}$ to minimize this estimated risk, then the resulting estimator ${\hat p}$ satisfies the following: (see Wegkamp, 1999): for any ${\epsilon >0}$,

$\displaystyle \mathbb{E} \int (\hat p - p)^2 \leq (1+\epsilon) \int (p_* - p)^2 + \frac{C \log N}{n}$

for a constant ${C}$ and ${p_*}$ is the oracle (the true minimizer).

This is a nice result. Like the regression result above, it gives us assurance that cross-validation (data-splitting) chooses a good tuning parameter.

But, I don’t think cross-validation really does solve the bandwidth selection problem in density estimation. The reason is that I don’t think ${L_2}$ loss is the right loss function.

Look at these three densities:

In this case, ${\int (p_0 - p_1)^2 = \int (p_0 - p_2)^2}$ so, in the ${L_2}$ sense, ${p_1}$ is just as good an approximation to ${p_0}$ as ${p_2}$ is. And yet, ${p_2}$ is in many respects a lousy approximation. The problem is that ${L_2}$ is insensitive to shape information.

Another example is shown in the following plot:

The top left plot is the true distribution which is a mixture of a bimodal density with a point mass at 0. Of course, this distribution is singular; it does not really have a density. The top right shows the density estimator with a small bandwidth, the bottom left shows the density estimator with a medium bandwidth and the bottom right shows the density estimator with a large bandwidth. I would call the estimator in the bottom left plot a success; it shows the structure of the data nicely. Yet, cross-validation leads to the choice ${h=0}$. The problem isn’t cross-validation. The problem here is to find a suitable loss function and then find a way to estimate the loss.

Despite years of research, I don’t think we have a good way to choose the bandwidth in a sample problem like density estimation.

3. Scale Space

Steve Marron takes a different point of view (Chaudhuri and Marron, 2002). Borrowing ideas from computer vision, he and his co-authors have argued that we should not choose a bandwidth. Instead, we should look at all the estimates ${\{\hat p_h:\ h\geq 0\}}$. This is called the scale-space view. The idea is that different bandwidths provide different information.

I have much sympathy for this viewpoint. And it is similar to Christian Hennig’s views on clustering.

Nevertheless, I think in many cases it is useful to have a data-driven default value and I don’t think we have one yet.

4. Conclusion

There are many other examples where we don’t convincing methods for choosing tuning parameters. In many journal and conference papers, the tuning parameters have been chosen in a very ad-hoc way to make the proposed method look successful.

In my humble opinion, we need to put more effort into finding data-driven methods for choosing these annoying tuning parameters.

5. References

Gy\”{o}rfi, Kohler, Krzy\'{z}ak and Walk. (2002). A Distribution-free Theory of Nonparametric Regression.

Chaudhuri, P. and Marron, J.S. (2000). Scale space view of curve estimation. The Annals of Statistics, 28, 408-428.

Wegkamp, M.H. (1999). Quasi-universal bandwidth selection for kernel density estimators. Canadian Journal of Statistics, 27, 409-420.

— Larry Wasserman

### Statistics Without Probability (Individual Sequences)

Happy Independence Day and Happy Higgs Day.

Frequentist statistics treats observations as random and parameters as fixed. Bayesian statistics treats everything as probabilistic. But there is an approach to statistics that removes probability completely. This is the theory of individual sequences which is a subset of online (sequential) learning. You can think of this theory of taking the idea of distribution free inference to an extreme. The description I’ll give is basically from the book:

Nicolò Cesa-Bianchi and Gábor Lugosi, (2006). Prediction, Learning, and Games

and the article:

Cesa-Bianchi and Lugosi, (1999). On Prediction of Individual Sequences. The Annals of Statistics, 27, 1865-1895.

1. Individual Sequences

Let ${y_1,y_2,y_3,\ldots}$ be a sequence of numbers. For simplicity, we’ll assume that ${y_t \in\{0,1\}}$. We observe the data sequentially and our goal is to predict the next observation. For any time ${t}$, let

$\displaystyle y^t = (y_1,\ldots, y_t)$

denote the first ${t}$ observations.

The data sequence is not assumed to be random. It is just a sequence of numbers. In fact, we even allow for the possibility that the sequence is generated by an adversary who is trying to screw up our predictions.

We also have a set of algorithms (or predictors) ${{\cal F}=\{F_1, \ldots, F_N\}}$. (In the literature, these algorithms are often called experts but I prefer to use the term algorithms.) Each algorithm takes any observed sequence and spits out a prediction. So ${F_j(y^t)}$ is the prediction of ${y_{t+1}}$ from algorithm ${j}$ based on ${y^t = (y_1,\ldots, y_t)}$. We will assume that the predictions take values in the interval ${[0,1]}$. Your goal is to observe the sequence and the outputs of the algorithms and then construct your own prediction. The situation unfolds as follows:

1. You see ${y^{t-1}}$ and ${(F_{1,t},\ldots,F_{N,t})}$ where ${F_{j,t} = F_j(y^{t-1})}$ is the prediction for ${y_t}$ from algorithm ${j}$.
2. You make a prediction ${P_t}$.
3. ${y_t}$ is revealed.
4. You suffer loss ${\ell(P_t,y_t)}$.

We will focus on the loss ${\ell(p_t,y_t)=|p_t-y_t|}$ but the theory works more generally. The average cumulative loss after ${n}$ rounds experienced by algorithm ${j}$ is

$\displaystyle L_j(y^n) = \frac{1}{n}\sum_{t=1}^n |F_{j,t}-y_t| \equiv \frac{1}{n}S_j(y^n)$

where ${S_j(y^n) = \sum_{t=1}^n |F_{j,t}-y_t|}$. (I am using the average rather than the sum which is more common.)

You suffer average cumulative loss

$\displaystyle L_P(y^n) =\frac{1}{n}\sum_{t=1}^n |P_t-y_t|.$

Your regret is your cumulative loss compared to the loss of the best prediction algorithm:

$\displaystyle {\sf regret} =L_P(y^n) - \min_j L_j(y^n).$

Since we are not using probability, how do we deal with the fact that the regret depends on the particular sequence? The answer is: we take the worst case over all sequences. We define the maximum regret

$\displaystyle R_n = \max_{y^n \in \{0,1\}^n} \left( L_P(y^n) - \min_j L_j(y^n)\right)$

where the maximum is over all possible binary sequence of length ${n}$. Finally, the minimax regret is

$\displaystyle V_n = \inf_P \max_{y^n \in \{0,1\}^n} \left( L_P(y^n) - \min_j L_j(y^n)\right).$

This is the smallest possible maximum regret over all predictors ${P}$.

It has been shown that

$\displaystyle V_n \leq \sqrt{\frac{\log N}{2n}}.$

In fact, this bound is essentially tight.

Is there an algorithm ${P}$ that achieves this bound? Yes: it is the weighted majority algorithm going back to Vovk (1990) and Littlestone and Warmuth (1994).

The algorithm is simple: we simply take a weighted average of the predictions from ${F_1,\ldots, F_N}$. Let ${P_t(y^{t-1}) = \sum_{j=1}^N w_{j,t-1} \ F_{j,t}}$ where the weights are

$\displaystyle w_{j,t-1} =\frac{\exp\left\{ - \gamma S_{j,t-1} \right\}}{Z_t}$

and ${Z_t = \sum_{j=1}^N \exp\left\{ - \gamma S_{j,t-1} \right\}.}$ The ${w_j}$‘s are called exponential weights.

Theorem 1 Let ${\gamma = \sqrt{8 \log N/n}}$. Then

$\displaystyle L_P(y^n) - \min_{1\leq j\leq N} L_j(y^n) \leq \sqrt{\frac{\log N}{2n}}.$

The proof (again I refer you to Cesa-Bianchi and Lugosi 1999) is quite simple and, interestingly, it is very probabilistic. That is, even though the problem has no probability in it, the proof techniques are probabilistic.

The result is for a specific time ${n}$. We can make the result uniform over time as follows. If we set ${\gamma_t = \sqrt{ 8 \log N/t}}$ then we have:

$\displaystyle L_P(y^n) \leq \min_j L_j(y^n) + \sqrt{ \frac{1 + 12n \log N}{8n}}$

for all ${n}$ and for all ${y_1,y_2,\ldots, y_n}$.

Now suppose that ${{\cal F}}$ is an infinite class. A set ${{\cal G}=\{G_1,\ldots, G_N\}}$ is an ${r}$-covering if, for every ${F}$ and every ${y^n}$ there is a ${G_j}$ such that

$\displaystyle \sum_{t=1}^n |F_t(y^{t-1}) - G_{j,t}(y^{t-1})| \leq r.$

Let ${N(r)}$ denote the size of the smallest ${r}$-covering.

Theorem 2 (Cesa-Bianchi and Lugosi) The minimax regret satisfies

$\displaystyle V_n({\cal F}) \leq \inf_{r>0} \left( \frac{r}{n} + \sqrt{ \frac{\log N(r)}{2n}}\right)$

Cesa-Bianchi and Lugosi then construct a predictor that nearly achieves the bound of the form

$\displaystyle P_t = \sum_{k=1}^\infty a_k P_t^{(k)}$

where ${P_t^{(k)}}$ is a predictor based on a finite subset of ${{\cal F}}$. (Again, there are hidden connections to probability; the construction uses chaining, which is a familiar technique in empirical process theory.)

2. Back To Probability

What if we have batch data (as opposed to sequential data) and that we make the usual i.i.d stochastic assumption. Can we still use these online ideas? (The reference for this section is: Cesa-Bianchi, Conconi and Gentile 2004).

Using batchification it is possible to use online learning for non-online learning. For example, suppose we are given data: ${(Z_1,\ldots, X_n)}$ where ${Z_t = (X_t,Y_t)}$ and an arbitrary algorithm ${A}$ that takes data and outputs classifier ${H}$. We apply ${A}$ sequentially to get classifiers ${H_0,H_1,\ldots,H_n}$. Let

$\displaystyle M_n = \frac{1}{n}\sum_{t=1}^n \ell( H_{t-1}(X_t),Y_t)$

To choose a final classifier we either use ${\overline{H} = \frac{1}{n}\sum_{i=1}^n H_{t-1}}$ or define ${\hat H}$ by choosing ${H_t}$ to minimize

$\displaystyle \frac{1}{t}\sum_{i=1}^n \ell(H_t(X_t),Y_t) + \sqrt{ \frac{1}{2(n-t)} \log\left( \frac{n (n+1)}{\delta}\right)}$

Let ${R(H)}$ be the risk, that is, probability of misclassifying a future observation.

Theorem 3 If ${\ell}$ is convex:

$\displaystyle \mathbb{P}\left( R(\overline{H}) \geq M_n + \sqrt{\frac{2}{n}\log\left( \frac{1}{\delta}\right)}\right)\leq \delta.$

For any ${\ell}$,

$\displaystyle \mathbb{P}\left( R(\hat{H}) \geq M_n + \sqrt{\frac{36}{n}\log\left( \frac{2(n+1)}{\delta}\right)}\right)\leq \delta.$

This means that we get nice performance bounds in the traditional probabilistic sense from online methods.

3. Summary

Individual sequence prediction and, more generally, online prediction methods, are interesting because they require no stochastic assumptions. This is very satisfying because the usual assumption that the data are drawn from a distribution (or, if you are a Bayesian, the assumption of exchangeability) is a fiction. Data are just numbers. Except in special situations, assuming that the data are random is something we do by reflex but it is often not justifiable.

Most research in this area is done by people in machine learning. Statistics is being left behind. There are exceptions of course; some names that come to mind are Gabor Lugosi, Andrew Nobel, Peter Grunwald, Andrew Barron, Sasha Rakhlin, and others I am forgetting. Nonetheless, I think that most statisticians are unaware of this area. We need to be more involved in it.

4. References

Nicolò Cesa-Bianchi and Gábor Lugosi (2006). Prediction, Learning, and Games.

Cesa-Bianchi, N. and Conconi, A. and Gentile, C. (2004). On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50, 2050–2057.

Cesa-Bianchi and Lugosi, (1999). On Prediction of Individual Sequences. The Annals of Statistics, 27, 1865-1895.

Littlestone, N. and Warmuth, M. (1994). The weighted majority algorithm. Inform. and Computing. 108, 212-261.

Vovk, V. (1990). Aggregating strategies. Proc. Third Workshop on Computational Learning Theory, 371–383.

I recommend these lecture notes by Sasha Rakhlin and Karthik Sridharan.

—Larry Wasserman

### Topological Data Analysis

Topological data analysis (TDA) is a relatively new area of research that spans many disciplines including topology (in particular, homology), statistics, machine learning and computation geometry.

The basic idea of TDA is to describe the “shape of the data” by finding clusters, holes, tunnels, etc. Cluster analysis is special case of TDA. I’m not an expert on TDA but I do find it fascinating. I’ll try to give a flavor of what this subject is about.

1. Cycles and Holes

Let us start with clusters. Consider data ${Y_1,\ldots,Y_n}$ where each ${Y_i}$ is a point in ${\mathbb{R}^d}$. One way to define clusters is to define a graph over the data. We fix a tuning parameter ${\epsilon}$ and we put an edge between two data points ${Y_i}$ and ${Y_j}$ if ${||Y_i - Y_j|| \leq \epsilon}$. We can then define the clusters to be the connected components of this graph.

This type of clustering is identical to single linkage clustering.

There are problems with single linkage clustering; for example, single linkage cluster are inconsistent estimates of high density regions; see Hartigan (1981). So this should raise some alarm bells. But let’s ignore that for now.

To extract more topological information— in particular, to get the homology groups— we need to do some more work. In addition to the edges of the graph, we will include triangles and higher order simplices. Look at the following example:

The top left shows some data. We see, intuitively, that there is one “hole” in the data. Also, shown is the Čech complex, described below. Basically this consists of 0-simplices (the points), the 1-simplices (the edges) and a 2-simplex (a triangle). The top right and bottom left show two “cycles” (or loops) through the data. Both cycles show the same hole in the data. To capture the idea that both cycles capture the same hole we need to regard these cycles as equivalent. This means that we can deform one cycle into the other. More formally, the difference between the two cycles is a “filled-in cycle,” that is, it is the boundary of a higher-order object. Here is another example of two equivalent cycles based on a Figure 4.9 from Zomorodian 2005: (Afra Zomorodian, 2005, Topology for computing.

2. Making It Formal: The Čech Complex

Think of each data point as a 0-dimensional simplex. Let ${S_0=\{ [Y_1], \ldots, [Y_n]\}}$ denote these simplices. Think of each edge in the graph as a 1-dimensional simplex. Let ${S_1}$ denote these simplices (i.e. the edges). (We only include an edge if ${||Y_i - Y_j|| \leq \epsilon}$.) Now we form a set ${S_2}$ of 2-dimensional simplices (triangles) as follows. If three points ${Y_i,Y_j,Y_k}$ are such that ${B(Y_i,\epsilon)\cap B(Y_j,\epsilon)\cap B(Y_k,\epsilon)\neq \emptyset}$ then we include the simplex defined by these three points, denoted by ${[Y_i,Y_j,Y_k]}$, in ${S_2}$. Here, ${B(Y_i,\epsilon)}$ is a ball of radius ${\epsilon}$ centered at ${Y_i}$.

Now keep going and form the 3-dimensional simplices ${S_3}$, the 4-dimensional simplices ${S_4}$, and so on. (Note that a ${k}$ simplex is defined by ${k+1}$ points.)

The collection ${C_\epsilon}$ of simplices ${\{S_0,S_1, \ldots, S_{n-1}\}}$ has a special structure. In particular, if ${\sigma\in C_\epsilon}$ then any face of ${\sigma}$ is also in ${C_\epsilon}$. A class of simplices with this property is called a simplicial complex. This particular simplicial complex is called a Čech complex.

The Čech complex contains topological information. To see this, we need a few more definitions. Specifically, we introduce chains ${C_p}$ which are sets of ${p}$-dimensional simples, cycles ${Z_p}$ which are chains without boundary (cycles), and boundary cycles ${B_p}$ which are filled in cycles, that is, they are cycles but they are also boundaries of higher order chains. This leads to the following key picture of homology:

A chain of order ${k}$ is any set of ${k}$-simplices. Formally, these are written as a sum. For example,

$\displaystyle [Y_1,Y_2] + [Y_2,Y_3] + [Y_3,Y_4]$

is a chain consisting of three, one-dimensional simplices (edges). Intuitively, you can think of the sum as taking the set difference. Thus ${[Y_1,Y_2] + [Y_2,Y_3] =[Y_1,Y_3]}$. The chains ${C_p}$ are a group under addition.

The boundary of a simplex ${\sigma}$ is obtained by removing each element one at a time and adding the result. For example, the boundary of the triangle ${[Y_1,Y_2,Y_3]}$ is

$\displaystyle \partial_2 [Y_1,Y_2,Y_3] = [Y_2,Y_3] + [Y_1,Y_3] + [Y_1,Y_2] = [Y_2,Y_1] + [Y_1,Y_2] = \emptyset.$

Thus, the triangle has an empty 2-boundary.

**** As pointed out below by Ramsay, this is an error. Need to apply $\partial_1$ after applying $\partial_2$ and then
we get the empty set.

More generally, the boundary of a ${k}$-simplex ${\sigma = [Y_1,\ldots, Y_k]}$ is

$\displaystyle \partial_k [Y_1,\ldots, Y_k]= \sum_j [Y_1,\ldots, \hat{Y}_j,\ldots, Y_k]$

where ${\hat Y_j}$ means that ${Y_j}$ has been removed. The boundary of a chain ${C = \sum_j \sigma_j}$ is defined to be ${\partial_k C = \sum_j \partial_k \sigma_j}$.

A chain with an empty boundary is a cycle. The set of cycles is denoted by ${Z_p}$. Formally, cycles are the kernel of boundary map: ${Z_p = {\sf ker}\partial_p}$.

Some elements of ${Z_p}$ are “filled in cycles.” Formally, these boundary cycles are boundaries of chains in ${C_{p+1}}$. That is, the boundary cycles ${B_p \subset Z_p}$ are the image of ${\partial_{p+1}}$: ${B_p = {\sf im}\partial_{p+1}}$. It may be verified that

$\displaystyle \partial_{p-1}\partial_p C_p = \emptyset.$

This yields the key picture above.

Now two cycles ${z}$ and ${z'}$ are considered equivalent if they differ by an element of ${B_p}$. That is, ${z\sim z'}$ if ${z = z' + b}$ for some ${b\in B_P}$. The set equivalence classes ${H_p}$ is the homology group. Formally, ${H_p = Z_p/C_p}$. (Think of the two cycles in the first example defining the same hole.) The order of the group is called the Betti number and is denoted by ${\beta_p}$.

How do we compute the groups and the Betti numbers? All of the above can be expressed as matrices. Basically, the matrices capture which simplices are boundaries of which other simplices. This is the appeal of homology: it can be expressed as linear algebra. The Matlab package “Plex” will take a set of points and compute all the stuff we have discussed.

In case your getting confused, let’s step back and summarize the main point: ${\beta_0}$ is the number of connected components. ${\beta_1}$ is the number of holes. ${\beta_2}$ is the number of tunnels. etc.

3. Persistence

Notice that there is a tuning parameter ${\epsilon}$. The conclusions depend crucially on ${\epsilon}$. How do we choose it? Usually, we don’t. Instead, one displays the topological information as a function of ${\epsilon}$. This can be done in a barcode plot as follows: (this is Figure 4 from Ghrist (2008)):

Source: Figure 4, from Ghrist (2008), Barcodes: The persistent topology of data, Bulletin-American Mathematical Society, 45, page 61.

What we see here is that as ${\epsilon}$ varies, some topological features appear and then disappear. The features that persist for a long stretch are considered to be real; the small ones are called topological noise. This type of analysis is called persistent homology.

4. What Are We Actually Estimating?

Suppose the data ${Y_1,\ldots, Y_n}$ were obtained as follows. First we sample data ${X_1,\ldots, X_n}$ from a distribution ${G}$ which is supported on some manifold ${M}$. We don’t observe ${X_i}$. Instead, we observe ${Y_i = X_i + \delta_i}$ where ${\delta_1,\ldots,\delta_n}$ represent random noise.

The manifold ${M}$ has homology groups defined in a way similarly to how we defined the groups for the data. The hope is that the homology from the data estimates the homology of ${M}$. The implicit belief seems to be that persistent features (long bars in the barcode plot) are real features of the underlying manifold.

What’s known more precisely is that, under certain conditions, and for some choices of ${\epsilon}$, the Čech complex has high probability of having the right homology. See, for example, Niyogi, Smale and Weinberger (2011). There are other ways to try to estimate the homology of ${M}$; see for example, Caillerie (et al 2012).

In fact, the data need to be cleaned first by removing low density points. This means we need to first do some sort of density estimation. (This relates to my earlier comment about the lack of consistency of single linkage clustering.) So there are in fact many tuning parameters and practical ways to choose appropriate tuning parameters still seem elusive. This is an example of what I like to call the curse of tuning parameters.

5. What Do We Use It For?

TDA is fascinating. But is it useful? One can find many examples of TDA in genomics, neuroscience and image analysis, for example,

But is TDA crucial here or is it a hammer looking for a nail? After all, the most successful data analysis tools are simple and homology is pretty complicated.

To me, the jury is still out. I find TDA interesting but whether it will prove to be an enduring set of data analysis methods remains to be seen. In the meantime, I hope people will keep thinking about it.

6. Refefences

Edelsbrunner, H. and Harer, J. (2010). Computational Topology: An Introduction.

Zomorodian, Afra. (2005). Topology for Computing.

Balakrishnan, S., Rinaldo, A., Sheehy,D., Singh,A. and Wasserman, L. (2011). Minimax Rates for Homology Inference.

Ghrist (2008), Barcodes: The persistent topology of data, Bulletin-American Mathematical Society, 45, page 61.

Y. Dabaghian, F. Mémoli, L. Frank, G. Carlsson. (2012). A topological paradigm for hippocampal spatial map formation using persistent homology. PLoS Computational Biology. link.

Niyogi, P. and Smale, S. and Weinberger, S. (2011). A topological view of unsupervised learning from noisy data. SIAM Journal on Computing, 40, p 646.

Caillerie, Chazal, Dedecker, and Michel. (2012). Deconvolution for the Wasserstein metric and geometric inference. Electronic Journal of Statistics, 5, 1394–1423. link.

— Larry Wasserman