## Modern Two-Sample Tests

When you are a student, one of the first problems you learn about is the two-sample test. So you might think that this problem is old news. But it has had a revival: there is a lot of recent research activity on this seemingly simple problem. What makes the problem still interesting and challenging is that, these days, we need effective high dimensional versions.

I’ll discuss three recent innovations: kernel tests, energy tests, and cross-match tests.

This post will have some ML ideas that are mostly ignored in statistics and as well as some statistics ideas that are mostly ignored in ML.

(By the way, much of what I’ll say about two-sample tests also applies to the problem of testing whether two random variables ${X}$ and ${Y}$ are independent. The reason is that testing for independence really amounts to tasting whether two distribution are the same, namely, the joint distribution ${P_{X\times Y}}$ for ${(X,Y)}$ and the product distribution ${P_X \times P_Y}$.)

1. The Problem

We observe two independent samples:

$\displaystyle X_1,\ldots, X_n \sim P$

and

$\displaystyle Y_1,\ldots, Y_m \sim Q.$

(These are random vectors.) The problem is to test the null hypothesis

$\displaystyle H_0: P=Q$

versus the alternative hypothesis

$\displaystyle H_1: P\neq Q.$

The various tests that I describe below all define a test statistic ${T}$ which s a function of the data ${X_1,\ldots, X_n ,Y_1,\ldots, Y_m}$. As usual, we reject ${H_0}$ if ${T > t}$ for some critical value ${t}$. We choose ${t}$ such that, if ${H_0}$ is true then

$\displaystyle W(T>t) \leq \alpha$

where ${W}$ is the distribution of ${T}$ when ${H_0}$ is true.

Some of the test statistics ${T}$ can be pretty complicated. This raises the question of how we choose ${t}$. In other words, how do we find the distribution ${W}$ of ${T}$ under ${H_0}$?

There are four ways:

1. We may be able to find ${W}$ explicitly.
2. We may be able to find the large sample limit of ${W}$ and use this as an approximation.
3. Bootstrapping.
4. Permutation testing.

The coolest, most general and most amazing approach is number 4, the permutation approach. This might surprise you, but I think the permutation approach to testing is one of the greatest contributions of statistics to science. And, as far as I can tell, it is not well known in ML.

Since the permutation approach can be used for any test, I’ll describe the permutation method before describe any of the test statistics.

2. The Amazing Permutation Method

Let ${T}$ be any test statistic. It doesn’t matter what the statistic is or how complicated it is. Let us organize the data as follows:

0 ${X_1}$
0 ${X_2}$
${\vdots}$ ${\vdots}$
0 ${X_n}$
1 ${Y_1}$
1 ${Y_2}$
${\vdots}$ ${\vdots}$
1 ${Y_m}$

We have added group labels to indicate the two samples. The test statistic ${T}$, whatever it is, is a function of the data vectors ${D=(X_1,\ldots,X_n,Y_1,\ldots, Y_m)}$ and the group labels ${G=(0,\ldots, 0,1,\ldots, 1)}$.

If ${H_0}$ is true, then the entire data vector is an iid sample from ${P}$ and the group labels are meaningless. It is as if someone randomly wrote down some 0’s and 1’s.

Now we randomly permute the group labels and recompute the test statistic. This changes the values of ${T}$ but (under ${H_0}$) it does not change the distribution of ${T}$. We then repeat this process ${N}$ times (where ${N}$ is some large number like 10,000). Let us denote the values of the test statistics from the permuted data by ${T_1,\ldots, T_N}$. Since the assignment of the 0’s and 1’s is meaningless under ${H_0}$, the ranks of ${T,T_1,\ldots, T_N}$ are uniformly distributed. That is, if we order these values, then the original statistics ${T}$ is equally likely to be anywhere in this ordered list.

Finally, we define the p-value

$\displaystyle p = \frac{1}{N}\sum_{j=1}^n I(T_j \geq T)$

where ${I}$ is the indicator function. So ${p}$ is just the fraction of times ${T_j}$ is larger than ${T}$. Because the ranks are uniform, ${p}$ is uniformly distributed (except for a bit of discreteness). If we reject ${H_0}$ when ${p < \alpha}$, we see that the type I error, that s the probability of a false rejection, is no more than ${\alpha}$.

There are no approximations here. There are no appeals to asymptotics. This is exact. No matter what the statistic ${T}$ is and no mater what the distribution ${P}$ is, if we use the permutation methodology, we get correct type I error control.

Now we can discuss the test statistics.

3. Kernel Tests

For a number of years, a group of researchers has used kernels and reproducing kernel Hilbert spaces (RKHS) to define very general tests. Some key references are Gretton, Fukumizu, Harchaoui, Sriperumbudur (2012), Gretton, Borgwardt, Rasch, Sch{ö}lkopf and Smola (2012) Sejdnovic, Gretton, Sriperumbudur and Fukumizu (2012).

To start with we choose a kernel ${K}$. It will be helpful to have a specific kernel in mind so let’s consider the Gaussian kernel

$\displaystyle K_h(x,y) = \exp\left( - \frac{||x-y||^2}{h^2}\right).$

The test statistic is

$\displaystyle T = \frac{1}{n^2}\sum_{i=1}^n \sum_{j=1}^n K_h(X_i,X_j)- \frac{2}{mn}\sum_{i=1}^n \sum_{j=1}^m K_h(X_i,Y_j)+ \frac{1}{m^2}\sum_{i=1}^n \sum_{j=1}^n K_h(Y_i,Y_j).$

There is some deep theory justifying this statistic (see above references) but the intuition is clear. Think of ${K_h(x,y)}$ as a measure of similarity between ${x}$ and ${y}$. Then ${T}$ compares the typical similarity of a pair of points from the same group versus the similarity of a pair of points from different groups.

The aforementioned authors derive many interesting properties of ${T}$ including the limiting distribution. But, for practical purposes, we don’t need to know the distribution. We simply apply the permutation method described earlier.

If you are paying close attention, you will have noticed that there is a tuning parameter ${h}$ so we should really write the statistic as ${T_h}$. How do we choose this?

We don’t have to. We can define a new statistic

$\displaystyle T = \sup_{h\geq 0} T_h.$

This new statistic has no tuning parameter. The null distribution is a nightmare. But that doesn’t matter. We simply apply the permutation method to the new statistic ${T}$.

4. Energy Test

Szekely and Rizzo (2004, 2005) defined a test (which later they turned into a test for independence: see Szekely and Rizzo 2009) based on estimating the following distance between ${P}$ and ${Q}$:

$\displaystyle D(P,Q) = 2 E||X-Y|| - E||X-X'|| - E||Y-Y'||$

where ${X,X' \sim P}$ and ${Y,Y' \sim Q}$. The test statistic ${T}$ is a sample estimate of this distance. The resulting test has lots of interesting properties.

Now, ${D(P,Q)}$ is based on the Euclidean metric and obviously this can be generalized to other metrics ${\rho}$. Sejdnovic, Gretton, Sriperumbudur and Fukumizu (2012) show that, under certain conditions on ${\rho}$, the resulting test corresponds exactly to some kernel test. Thus we can consider the energy test as a type of kernel test.

5. The Cross-Match Test

There is a long tradition of defining two sample tests based on certain “local coincidences.” I’ll describe just one: the cross-match test due to Paul Rosenbaum (2005).

The test works as follows. Ignore the labels and treat the data as one sample ${Z_1,\ldots, Z_{n+m}}$ of size ${n+m}$. Now form a non-bipartite matching. That is, put the data into non-overlapping pairs. For example, a typical pairing might look like this:

$\displaystyle \{Z_3, Z_8\},\ \{Z_{17}, Z_6\}, \ldots$

(For simplicity, assume there are an even number of observations.) However, we don’t pick just any matching. We choose the matching that minimizes the total distances between the points in each pair. (Any distance can be used.)

Now we look at the group labels. Some pairs will have labels ${(0,0)}$ or ${(1,1)}$ while some will have labels of the form ${(0,1)}$ or ${(1,0)}$. Let ${a_0}$ be the number of pairs of type ${(0,0)}$, let ${a_2}$ be the number of pairs of type ${(1,1)}$, and let ${a_1}$ be the number of pairs of type ${(0,1)}$ or ${(1,0)}$.

Define ${T=a_1}$. Under ${H_0}$, it is as if someone randomly sprinkled 0’s and 1’s onto the data and we can easily compute the exact distribution of ${T}$ under ${H_0}$ which is

$\displaystyle P(T=t) = \frac{2^t N!}{\binom{N}{m} a_0!\, a_1! \, a_2!}$

where ${N = a_0 + a_1 + a_2}$.

In fact, this distribution can be accurately approximated by a Normal with mean ${nm/(N-1)}$ and variance ${2 n (n-1)m (m-1)/( (N-3)(N-1)^2)}$. (Note that approximating a known, exact, distribution which contains no nuisance parameters is quite different than using a test whose null distribution is exact only in the limit.)

So the nice thing about Rosenbaum’s test is that it does not require the permutation methods. This might not seem like a big deal but it is useful if you are doing many tests.

6. Which Test Is Best?

There is no best test, of course. More precisely, the power of the test depends on ${P}$ and ${Q}$. No test will uniformly dominate another test.

If you are undecided about which test to use, you can always combine them into one meta-test. Then you can use the permutation method to get the p-value.

Having said that, I think there is room for more research about the merits of various tests under different assumptions, especially in the high-dimensional case.

7. References

Gretton, A. and Fukumizu, K. and Harchaoui, Z. and Sriperumbudur, B.K. (2012). A fast, consistent kernel two-sample test. Advances in Neural Information Processing Systems, 673–681.

Gretton, A., Borgwardt, K.M., Rasch, M.J., Sch{ö}lkopf and Smola, A. (2012). A kernel two-sample test. The Journal of Machine Learning Research, 13, 723–773.

Rosenbaum, P.R. (2005). An exact distribution-free test comparing two multivariate distributions based on adjacency. Journal of the Royal Statistical Society: Series B, 67, 515-530.

Sejdnovic, D., Gretton, A., Sriperumbudur, B. and Fukumizu, K. (2012). Hypothesis testing using pairwise distances and associated kernels. link

Sz{é}kely, G.J. and Rizzo, M.L. (2005). A new test for multivariate normality. Journal of Multivariate Analysis, 93, 58-80.

Szekely, G.J. and Rizzo, M.L. (2004). Testing for equal distributions in high dimension. InterStat, 5.

Szekely, G.J. and Rizzo, M.L. (2009). Brownian distance covariance. The Annals of Applied Statistics, 3, 1236-1265.

—Larry Wasserman

1. Absolut
Posted July 14, 2012 at 12:25 pm | Permalink

The Kolmogorov-Smirnov test is worthy of mentioning as well. In my experience, it performs badly when the distributions $P$ and $Q$ differ on the tails. This is due to the nature of the statistic (the Kolmogorov-Smirnov distance between the corresponding empirical distributions).

• Posted July 14, 2012 at 12:30 pm | Permalink

Good point.
It also does ot generalize to high
dimensions very well.

—LW

• Absolut
Posted July 14, 2012 at 1:26 pm | Permalink

Indeed, that is a second Achilles’ heel of the KS test.

Do you know how well the permutation method perfoms in one-dimensional cases where P and Q have different tails but similar “shoulders”?

• Posted July 14, 2012 at 1:32 pm | Permalink

That’s determined by the choice of test statistic.
The permutation method just finds the critical value.
It doesn’t define the test itself.

—LW

2. Anonymous
Posted July 14, 2012 at 5:32 pm | Permalink

Another approach, used by people who work in Finance, for portfolio’s risk assessment, with a lot sucess is to model the dependency using copulas, usually parametric copulas. In this case, testing for equality reduces to testing independence, that is, if the copula is a product copula. And I think it is not very used neither in Statistics nor in ML.

3. Anonymous
Posted July 15, 2012 at 12:00 am | Permalink

I wonder what your suggestions when the sample size is small, which is common in lots of context. Permutation or Bootstrapping probably will not work well because the best pvalue you get would be large. Also KS test will never work in this situation. Not sure what you think of parametric modeling+Bayesian regularization which usually perform more robust than general methods? Thanks a lot.
Another type of rank-based nonparametric tests such as Wilcoxon, Mann–Whitney, etc are also used by many people in statistics, might be too conservative in many situations.

• Posted July 15, 2012 at 8:27 am | Permalink

I think the large p-value for small samples simply reflects, correctly,
the fact that you don’t have much information. It should be large
in those cases.

—LW

4. Posted July 15, 2012 at 3:15 pm | Permalink

The permutation method is a great idea – do you have any idea how it performs relative to bootstrapping on simulations? i.e. is it more conservative with its test thresholds?

I guess another way to pick thresholds is based on tail bounds for W which are often easy to obtain and more rigorous than plugging in the asymptotic distribution directly without accounting for the errors. At least for the kernel tests, Arthur tells me they aren’t very good though.

There is also this paper by Miles Lopes, Laurent Jacob and Martin Wainwright – for high dimensional two-sample tests when P and Q are Gaussian with different means

http://arxiv.org/abs/1108.2401

• Posted July 15, 2012 at 4:12 pm | Permalink

The permutation test is exact.
The bootstrap is only approximate.

• Keith O'Rourke
Posted July 16, 2012 at 5:03 pm | Permalink

Should it be clear as to what you mean by exact and approximate here?

Above for the permutation test you used a random sample of just 10,000 while for the naive bootstrap – if the sample size was small enough – one could enumerate all possible sample paths of iid draws from the empirical distribution taken as the true unknown distribution and be exact.

I would suggest the assumptions of the permutation test are strong (i.e. the labels were random and nothing else was) but ofen are ensured to not be too wrong under the (strict Fisher) null.

The combining idea is interesting – given uniform dependent p-values and all the functions that could be entertained to combine them.

• Posted July 16, 2012 at 5:32 pm | Permalink

It is exact no matter how many random permutations you use.
Exact means: Pr(type I error ) <= alpha

The only assumptions are i.i.d.
No stronger than usual

5. Christian Hennig
Posted July 16, 2012 at 10:55 am | Permalink

Nice posting. Regarding this one: “If you are undecided about which test to use, you can always combine them into one meta-test”: is there any reason to believe that the resulting meta-test will be better than a) all b) any or c) a randomly selected one of the three (or more) mentioned approaches? I’m not so sure (which really means that I’m not sure, not that I wanted to say that it’s a bad idea).

• Zach
Posted July 16, 2012 at 11:14 am | Permalink

I second this question

• Posted July 16, 2012 at 11:16 am | Permalink

Good point. It is indeed not really clear which is better

6. Posted July 23, 2012 at 3:40 pm | Permalink

Thanks for this fruitful post.
I am just curious about how to do “two samples t-tests” when there are covariates? Are there some modern methods?

• Posted July 23, 2012 at 5:15 pm | Permalink

It depends on the setting.
Typically, that corresponds to testing
if a coefficient is 0 in a regression
but a “truly” distribution-free version is not
obvious.

7. Posted July 26, 2012 at 4:56 am | Permalink

Great posting. The extended version of our ICML paper that discusses equivalence of energy tests and kernel tests has now appeared on arXiv:
http://arxiv.org/abs/1207.6076

8. Anonymous
Posted July 29, 2012 at 2:59 pm | Permalink

What you describe as permutation seems to me identical to what I have seen called bootstrapping where the resampling is done from the “pooled” sample. The label does not concern me, but I do wonder what you mean when you say that the permutation test is exact, while he bootstrap is only approximate.

• Posted July 29, 2012 at 3:19 pm | Permalink

They are similar but not the same.
The bootstrap samples n observations from the empirical distribution.
(Actually, in a testing problem, the empirical has to be corrected to
be consistent to the null hypothesis).
The type I error goes to 0 as the sample size goes to infinity.
In the permutation test, the type I error is less than alpha.
No large sample approximation needed.
–LW

9. Mike
Posted March 15, 2013 at 4:47 pm | Permalink

I am interested in using the permutation method combined with the energy test to compare the distributions of a training dataset and a dataset generated from a Bayesian belief network that was learned from the training data. I’m a bit of a noob as far as statistical tests go, but how would you calculate the power of such a test, and the minimum sample size? Also, does this testing method require that both datasets come from different sources, as the KS test does? Thanks in advance.