In the olden days, multiple testing meant testing 8 or 9 hypotheses. Today, multiple testing can involve testing thousands or even million of hypotheses.
A revolution occurred with the publication of Benjamini and Hochberg (1995). The method introduced in that paper has made it feasible to test huge numbers of hypotheses with high power. The Benjamini and Hochberg method is now standard in areas like genomics.
1. Multiple Testing
We want to test a large number of null hypotheses . Let if the null hypothesis is true and let if the null hypothesis is false. For example, might be the hypothesis that there is no difference in mean gene expression level between healthy and diseased tissue, for the gene.
For each hypothesis we have a test statistic and a p-value computed from the test statistic. If is true (no difference) then has a uniform distribution on . If is false (there is a difference) then has some other distribution, typically more concentrated towards 0.
If we were testing one hypotheses, we would reject the null hypothesis if the p-value is less than . The type I error — the probability of a false rejection — is then . But in multiple testing we can’t simply reject all hypotheses for which . When is large, we will make many type I errors.
A common and very simple way to fix the problem is the Bonferroni method: reject when . The set of rejected hypotheses is
It follows from the union bound that
where is the set of true null hypotheses.
The problem with the Bonferroni method is that the power — the probability of rejecting when — goes to 0 as increases.
Instead of controlling the probability of any false rejections, the Benjamini-Hochberg (BH) method controls the false discovery rate (FDR) defined to be
is the number of false rejections and is the number of rejections. Here, FDP is the false discovery proportion.
The BH method works as follows. Let
be the ordered p-values. The rejection set is
(If the p-values are not independent, an adjustment may be required.) Benjamini and Hochberg proved that, if this method is used then .
3. Why Does It Work?
The original proof that is a bit complicated. A slick martingale proof can be found in Storey, Taylor and Siegmund (2003). Here, I’ll give a less than rigorous but very simple proof.
Suppose the fraction of true nulls is . The distribution of the p-values can be written as
where is the uniform (0,1) distribution (the nulls) and is some other distribution on (0,1) (the alternatives). Let
be the empirical distribution of the p-values. Suppose we reject all p-values less than . Now
Now let be equal to one of the ordered p-values, say . Thus , and
Setting the right hand side to be less than or equal to yields
or in other words, choose to satisfy
which is exactly the BH method.
To summarize: we reject all p-values less than where
We then have the guarantee that .
The method is simple and, unlike Bonferroni, the power does not go to 0 as .
There are now many modifications to the BH method. For example, instead of controlling the mean of the FDP you can choose so that
which is called FDP control. (Genovese and Wasserman 2006). One can also weight the p-values (Genovese, Roeder, and Wasserman 2006).
FDR methods control the error rate while maintaining high power. But it is important to realize that these methods give weaker control than Bonferroni. FDR controls the (expected) fraction of false rejections. Bonferroni protects you from making any false rejections. Which you should use is very problem dependent.
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological). 289–300.
Genovese, C.R. and Roeder, K. and Wasserman, L. (2006). False discovery control with p-value weighting. Biometrika, 93, 509-524.
Genovese, C.R. and Wasserman, L. (2006). Exceedance control of the false discovery proportion. Journal of the American Statistical Association, 101, 1408-1417.
Storey, J.D. and Taylor, J.E. and Siegmund, D. (2003). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66, 187–205.