How Close Is The Normal Distribution?


One of the first things you learn in probability is that the average {\overline{X}_n} has a distribution that is approximately Normal. More precisely, if {X_1,\ldots, X_n} are iid with mean {\mu} and variance {\sigma^2} then

\displaystyle  Z_n \rightsquigarrow N(0,1)


\displaystyle  Z_n = \frac{\sqrt{n}(\overline{X}_n - \mu)}{\sigma}

and {\rightsquigarrow} means “convergence in distribution.”

1. How Close?

But how close is the distribution of {Z_n} to the Normal? The usual answer is given by the Berry-Esseen theorem which says that

\displaystyle  \sup_t |P(Z_n \leq t) - \Phi(t)| \leq \frac{0.4784 \,\beta_3}{\sigma^3 \sqrt{n}}

where {\Phi} is the cdf of a Normal(0,1) and {\beta_3 = \mathbb{E}(|X_i|^3)}. This is good news; the Normal approximation is accurate and so, for example, confidence intervals based on the Normal approximation can be expected to be accurate too.

But these days we are often interested in high dimensional problems. In that case, we might be interested, not in one mean, but in many means. Is there still a good guarantee for closeness to the Normal limit?

Consider random vectors {X_1,\ldots, X_n\in \mathbb{R}^d} with mean vector {\mu} and covariance matrix {\Sigma}. We’d like to say that {\mathbb{P}(Z_n \in A)} is close to {\mathbb{P}(Z \in A)} where {Z_n = \Sigma^{-1/2}(\overline{X}_n - \mu)} and {Z\sim N(0,I)}. We allow the dimension {d=d_n} grow with {n}.

One of the best results I know of is due to Bentkus (2003) who proved that

\displaystyle  \sup_{A\in {\cal A}} | \mathbb{P}(Z_n \in A) - \mathbb{P}(Z \in A) | \leq \frac{400\, d^{1/4} \beta}{\sqrt{n}}

where {{\cal A}} is the class of convex sets and {\beta = \mathbb{E} ||X||^3}. We expect that {\beta = C d^{3/2}} so the error is of order {O(d^{7/4}/\sqrt{n})}. This means that we must have {d = o(n^{2/7})} to make the error go to 0 as {n\rightarrow\infty}.

2. Ramping Up The Dimension

So far we need {d^{7/2}/n \rightarrow 0} to justify the Normal approximation which is a serious restriction. Most of the current results in high dimensional inference, such as the lasso, do not place such as severe restriction on the dimension. Can we do better than this?

Yes. Right now we are witnessing a revolution in Normal approximations thanks to Stein’s method link.
This is a method for bounding the distance from Normal approximations invented by Charles Stein in 1972.

Although the method is 40 years old, there has recently been an explosion of interest in the method. Two excellent references are the book by Chen, Goldstein and Shao (2012) and the review article by Nathan Ross which can be found here.

An example of the power of this method is the very recent paper by Victor Chernozhukov, Denis Chetverikov and Kengo Kato. They showed that, if we restrict {{\cal A}} to rectangles rather than convex sets, then

\displaystyle  \sup_{A\in {\cal A}} | \mathbb{P}(Z_n \in A) - \mathbb{P}(Z \in A) | \rightarrow 0

as long as {(\log d)^7/n \rightarrow 0}. (In fact, they use a lot of tricks besides Stein’s method but Stein’s method plays a key role).

This is an astounding improvement. We only need {d} to be smaller than {e^{n^{1/7}}} instead of {n^{2/7}}.

The restriction to rectangles is not so bad; it leads immediately to a confidence rectangle for the mean, for example. The authors show that their results can be used to derive further results for bootstrapping, for high-dimensional regression and for hypothesis testing.

I think we are seeing the beginning of a new wave of results on high dimensional Berry-Esseen theorems. I will do a post in the future on Stein’s method.


Bentkus, Vidmantas. (2003). On the dependence of the Berry-Esseen bound on dimension. Journal of Statistical Planning and Inference, 385-402.

Chen, Louis Goldstein, Larry and Shao, Qi-Man. (2010). Normal approximation by Stein’s method. Springer.

Victor Chernozhukov, Denis Chetverikov and Kengo Kato. (2012). Central Limit Theorems and Multiplier Bootstrap when p is much larger than n.

Ross, Nathan. (2011). Fundamentals of Stein’s method. Probability Surveys, 8, 210-293.

Stein, Charles. (1986), Approximate computation of expectations. Lecture Notes-Monograph Series 7.


  1. Posted February 4, 2013 at 10:22 am | Permalink

    What still surprises me about statistics or the way statisticians do their business is the following: The Berry-Esseen theorem says that a confidence interval chosen based on the CLT is possibly shorter by a good amount of $c/\sqrt{n}$. Despite this, statisticians keep telling me that they prefer their “shorter” CLT-based confidence intervals to ones derived by using finite-sample tail inequalities that we, “machine learning people prefer” (lies vs. honesty?). I could never understood the logic behind this reasoning and I am wondering if I am missing something. One possible answer is that the Berry-Esseen result could be oftentimes loose. Indeed, if the summands are normally distributed, the difference will be zero. Thus, an interesting question is the study of the behavior of the worst-case deviation under assumptions (or: for what class of distributions is the Berry-Esseen result tight?). It would be interesting to read about this, or in general, why statisticians prefer to possibly make big mistakes to saying “I don’t know” more often.

    • Posted February 4, 2013 at 11:57 am | Permalink

      I prefer finite sample intervals.
      The problem is that there are many, many, many, practical problems where there aren’t
      any known finite sample intervals and we are forced to use Normal approximations.
      It’s by necessity not preference.

    • Keith O'Rourke
      Posted February 4, 2013 at 12:11 pm | Permalink


      You might find this paper of interest – Stigler, Stephen M. “The changing history of robustness.” The American Statistician 64.4 (2010): 277-281.

      • Posted February 17, 2013 at 12:32 am | Permalink

        Interesting read, thanks for the pointer!
        So the conclusion, with an analogy from the paper, is: “In the United States many consumers are entranced by the magic of the new iPhone, even though they can only use it with the AT&T system, a system noted for spotty coverage—even no receivable signal at all under some conditions. But the magic available when it does work overwhelms the very real shortcomings.” So classical results are like iPhone.
        I can understand this:) But then, again, I own an Android phone and tablet;)

2 Trackbacks

  1. […] Sobre teoremas de upper-bound para erros de aproximação pela curva normal (vale conferir uma sugestão que surgiu nos comentários do post, um texto histórico, bacana, […]

  2. […] Source URL: […]

%d bloggers like this: