Testing Millions of Hypotheses: FDR

In the olden days, multiple testing meant testing 8 or 9 hypotheses. Today, multiple testing can involve testing thousands or even million of hypotheses.

A revolution occurred with the publication of Benjamini and Hochberg (1995). The method introduced in that paper has made it feasible to test huge numbers of hypotheses with high power. The Benjamini and Hochberg method is now standard in areas like genomics.

1. Multiple Testing

We want to test a large number of null hypotheses {H_1,\ldots, H_N}. Let {H_j =0} if the {j^{\rm th}} null hypothesis is true and let {H_j =1} if the {j^{\rm th}} null hypothesis is false. For example, {H_j} might be the hypothesis that there is no difference in mean gene expression level between healthy and diseased tissue, for the {j^{\rm th}} gene.

For each hypothesis {H_j} we have a test statistic {Z_j} and a p-value {P_j} computed from the test statistic. If {H_j} is true (no difference) then {P_j} has a uniform distribution {U} on {(0,1)}. If {H_j} is false (there is a difference) then {P_j} has some other distribution, typically more concentrated towards 0.

If we were testing one hypotheses, we would reject the null hypothesis if the p-value is less than {\alpha}. The type I error — the probability of a false rejection — is then {\alpha}. But in multiple testing we can’t simply reject all hypotheses for which {P_j \leq \alpha}. When {N} is large, we will make many type I errors.

A common and very simple way to fix the problem is the Bonferroni method: reject when {P_j\leq \alpha/N}. The set of rejected hypotheses is

\displaystyle  R = \Bigl\{j:\ P_j \leq \frac{\alpha}{N}\Bigr\}.

It follows from the union bound that

\displaystyle  {\rm Probability\ of\ any\ false\ rejections\ }= P(R \cap {\cal N}) \leq \alpha

where {{\cal N} = \{j:\ H_j =0\}} is the set of true null hypotheses.

The problem with the Bonferroni method is that the power — the probability of rejecting {H_j} when {H_j=1} — goes to 0 as {N} increases.

2. FDR

Instead of controlling the probability of any false rejections, the Benjamini-Hochberg (BH) method controls the false discovery rate (FDR) defined to be

\displaystyle  {\rm FDR} = \mathbb{E}({\rm FDP})


\displaystyle  {\rm FDP} = \frac{F}{R},

{F} is the number of false rejections and {R} is the number of rejections. Here, FDP is the false discovery proportion.

The BH method works as follows. Let

\displaystyle  P_{(1)}\leq P_{(2)}\leq \cdots \leq P_{(N)}

be the ordered p-values. The rejection set is

\displaystyle  R = \Bigl\{j:\ P_j \leq T\Bigr\}

where {T=P_{(j)}} and

\displaystyle  j = \max \left\{ s:\ P_{(s)} \leq \frac{\alpha s}{N} \right\}.

(If the p-values are not independent, an adjustment may be required.) Benjamini and Hochberg proved that, if this method is used then {{\rm FDR} \leq \alpha}.

3. Why Does It Work?

The original proof that {{\rm FDR} \leq \alpha} is a bit complicated. A slick martingale proof can be found in Storey, Taylor and Siegmund (2003). Here, I’ll give a less than rigorous but very simple proof.

Suppose the fraction of true nulls is {\pi}. The distribution of the p-values can be written as

\displaystyle  G = \pi U + (1-\pi)A

where {U(t)=t} is the uniform (0,1) distribution (the nulls) and {A} is some other distribution on (0,1) (the alternatives). Let

\displaystyle  \hat G(t) = \frac{1}{N}\sum_{j=1}^N I(P_j \leq t)

be the empirical distribution of the p-values. Suppose we reject all p-values less than {t}. Now

\displaystyle  F = \sum_{j=1}^n I(P_j\leq t, H_j=0)


\displaystyle  R = \sum_{j=1}^n I(P_j\leq t) = N \hat{G}(t).


\displaystyle  \begin{array}{rcl}  \mathbb{E}\left( \frac{F}{R}\right) &=& \mathbb{E}\left(\frac{\sum_{j=1}^n I(P_j\leq t, H_j=0)}{\sum_{j=1}^n I(P_j\leq t)}\right)\\ &=& \frac{\mathbb{E}\left(\sum_{j=1}^n I(P_j\leq t, H_j=0)\right)} {\mathbb{E}\left(\sum_{j=1}^n I(P_j\leq t)\right)} + O\left(\sqrt{\frac{1}{N}}\right)\\ &=& \frac{N P(P_j \leq t|H_j=0) P(H_j=0)}{N G(t)} + O\left(\sqrt{\frac{1}{N}}\right)\\ &=& \frac{t\pi}{G(t)} + O\left(\sqrt{\frac{1}{N}}\right)= \frac{t\pi}{\hat G(t)} + O_P\left(\sqrt{\frac{1}{N}}\right) \end{array}

and so

\displaystyle  {\rm FDR} \approx \frac{t\pi}{\hat G(t)}.

Now let {t} be equal to one of the ordered p-values, say {P_{(j)}}. Thus {t = P_{(j)}}, {\hat G(t) = j/N} and

\displaystyle  {\rm FDR} \approx \frac{\pi P_{(j)}N}{j} \leq \frac{P_{(j)}N}{j}.

Setting the right hand side to be less than or equal to {\alpha} yields

\displaystyle  \frac{N P_{(j)}}{j} \leq \alpha

or in other words, choose {t=P_{(j)}} to satisfy

\displaystyle  P_{(j)} \leq \frac{\alpha j}{N}

which is exactly the BH method.

To summarize: we reject all p-values less than {T=P_{(j)}} where

\displaystyle  j = \max \left\{ s:\ P_{(s)} \leq \frac{\alpha s}{N} \right\}.

We then have the guarantee that {FDR \leq \alpha}.

The method is simple and, unlike Bonferroni, the power does not go to 0 as {N\rightarrow\infty}.

There are now many modifications to the BH method. For example, instead of controlling the mean of the FDP you can choose {T} so that

\displaystyle  P({\rm FDP} > \alpha) \leq \beta

which is called FDP control. (Genovese and Wasserman 2006). One can also weight the p-values (Genovese, Roeder, and Wasserman 2006).

4. Limitations

FDR methods control the error rate while maintaining high power. But it is important to realize that these methods give weaker control than Bonferroni. FDR controls the (expected) fraction of false rejections. Bonferroni protects you from making any false rejections. Which you should use is very problem dependent.

5. References

Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological). 289–300.

Genovese, C.R. and Roeder, K. and Wasserman, L. (2006). False discovery control with p-value weighting. Biometrika, 93, 509-524.

Genovese, C.R. and Wasserman, L. (2006). Exceedance control of the false discovery proportion. Journal of the American Statistical Association, 101, 1408-1417.

Storey, J.D. and Taylor, J.E. and Siegmund, D. (2003). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66, 187–205.


  1. Randall
    Posted October 5, 2012 at 3:23 pm | Permalink

    I think the “limitations” section of this post is a little misleading. It says “Bonferroni protects you from making any false rejections.” That is not quite right: Bonferroni holds the overall (across-test) p-value to the specified alpha (usually 0.05), so that the probability of making one or more false rejections is equal to alpha (not zero).

  2. Christian Hennig
    Posted October 9, 2012 at 11:49 am | Permalink

    If \pi=1, all rejections are false discoveries and FDR=1, isn’t it? So probably \pi<1 needs to be assumed but I don't see how the proof breaks down with \pi=1. I must be missing something. Or is it just that the statement holds for N to infinity and then j will converge to zero if \pi=1?

    • Posted October 10, 2012 at 8:25 am | Permalink

      Well we have:
      FDR(t) <= Upper(t)
      then setting Upper(t)=alpha and solving for t
      gives a valid threshold t.

      • Posted October 11, 2012 at 11:12 am | Permalink

        Probably my simplified derivation is just confusing.
        Check out the original BH paper and I think it will be clearer

      • Christian Hennig
        Posted October 11, 2012 at 9:59 am | Permalink

        Sorry for still being stupid but if \pi=1 and alpha smaller, this has to be wrong because FDR=1 constant, no??

    • Posted October 9, 2012 at 3:56 pm | Permalink

      It’s a conservative upper bound but it allows you to get a threshold
      without estimating pi.

      • Christian Hennig
        Posted October 10, 2012 at 7:00 am | Permalink

        OK, let me ask this again in a different way: Isn’t the FDR constant 1 if \pi=1? And doesn’t this imply that the given upper bound in this case is wrong? (I haven’t looked up literature to check this; the question comes just from the way you wrote it down. In the literature \pi=1 may be prohibited.)

    • wguo
      Posted January 13, 2013 at 3:45 am | Permalink

      The FDR is precisely defined as E( F I( F > 0) /R ) rather than E (F/R), where I is the indicator function. Thus, if \pi = 1, the FDR is equal to FWER = Pr (F > 0) rather than 1. Such relationship between the FDR and FWER was pointed out in the original BH paper.

  3. rfox
    Posted October 14, 2012 at 8:40 pm | Permalink

    You say “FDR methods control the error rate while maintaining high power”. This is highly misleading. The “power” advantage is relative to other error controls, like FWER (as you explain). But there is no guarantee of absolute “power”. For a moderate FDR rate – say 10% – absolute “power” still depends crucially on how many true NULL’s are lurking. For “needle in the NULL haystack” applications (and there are many) FDR is no inferential panacea.

  4. Posted November 13, 2012 at 10:35 am | Permalink

    Embarrassingly dumb question. In section 3, how do you get from E(F/R) = E(F)/E(R) + O(N^(-1/2))?


Get every new post delivered to your Inbox.

Join 1,015 other followers

%d bloggers like this: