Something that is well known in the statistics world but perhaps less well known in the machine learning world is Stein’s paradox.

When I was growing up, people used to say: do you remember where you were when you heard that JFK died? (I was three, so I don’t remember. My first memory is watching the Beatles on Ed Sullivan.)

Similarly, statisticians used to say: do you remember where you were when you heard about Stein’s paradox? That’s how surprising it was. (I don’t remember since I wasn’t born yet.)

Here is the paradox. Let ${X \sim N(\theta,1)}$. Define the risk of an estimator ${\hat\theta}$ to be

$\displaystyle R_{\hat\theta}(\theta) = \mathbb{E}_\theta (\hat\theta-\theta)^2 = \int (\hat\theta(x) - \theta)^2 p(x;\theta) dx.$

An estimator ${\hat\theta}$ is inadmissible if there is another estimator ${\theta^*}$ with smaller risk. In other words, if

$\displaystyle R_{\theta^*}(\theta) \leq R_{\hat\theta}(\theta) \ \ {\rm for\ all\ }\theta$

with strict inequality at at least one ${\theta}$.

Question: Is ${\hat \theta \equiv X}$ admissible.

Now suppose that ${X \sim N(\theta,I)}$ where now ${X=(X_1,X_2)^T}$, ${\theta = (\theta_1,\theta_2)^T}$ and

$\displaystyle R_{\hat\theta}(\theta) = \mathbb{E}_\theta ||\hat\theta - \theta||^2.$

Question: Is ${\hat \theta \equiv X}$ admissible.

Now suppose that ${X \sim N(\theta,I)}$ where now ${X=(X_1,X_2,X_3)^T}$, ${\theta = (\theta_1,\theta_2,\theta_3)^T}$ and

$\displaystyle R_{\hat\theta}(\theta) = \mathbb{E}_\theta ||\hat\theta - \theta||^2.$

Question: is ${\hat \theta \equiv X}$ admissible.

If you don’t find this surprising then either you’ve heard this before or you’re not thinking hard enough. Keep in mind that the coordinates of the vector ${X}$ are independent. And the ${\theta_j's}$ could have nothing to do with each other. For example, ${\theta_1 = }$ mass of the moon, ${\theta_2 = }$ price of coffee and ${\theta_3 = }$ temperature in Rome.

In general, ${\hat\theta \equiv X}$ is inadmissible if the dimension ${k}$ of ${\theta}$ satisfies ${k \geq 3}$.

The proof that ${X}$ is inadmissible is based on defining an explicit estimator ${\theta^*}$ that has smaller risk than ${X}$. For example, the James-Stein estimator is

$\displaystyle \theta^* = \left( 1 - \frac{k-2}{||X||^2}\right) X.$

It can be show that the risk of this estimator is strictly smaller than the risk of ${X}$, for all ${\theta}$. This implies that ${X}$ is inadmissible. If you want to see the detailed calculations, have a look at Iain Johnstone’s at this site which he makes freely available on his website.

Note that the James-Stein estimator shrinks ${X}$ towards the origin. (In fact, you can shrink towards any point; there is nothing special about the origin.) This can be viewed as an empirical Bayes estimator where ${\theta}$ has a prior of the form ${\theta \sim N(0,\tau^2)}$ and ${\tau}$ is estimated from the data. The Bayes explanation gives some nice intuition. But it’s also a bit misleading. The Bayes explanation suggests we are shrinking the means together because we expect them a priori to be similar. But the paradox holds even when the means are not related in any way.

Some intuition can be gained by thinking about function estimation. Consider a smooth function ${f(x)}$. Suppose we have data

$\displaystyle Y_i = f(x_i) + \epsilon_i$

where ${x_i = i/n}$ and ${\epsilon_i \sim N(0,1)}$. Let us expand ${f}$ in an orthonormal basis: ${f(x) = \sum_j \theta_j \psi_j(x)}$. To estimate ${f}$ we need only estimate the coefficients ${\theta_1,\theta_2,\ldots,}$. Note that ${\theta_j = \int f(x) \psi_j(x) dx}$. This suggests the estimator

$\displaystyle \hat\theta_j = \frac{1}{n}\sum_{i=1}^n Y_i \psi_j(x_i).$

But the resulting function estimator ${\hat f(x) = \sum_j \hat\theta_j \psi_j(x)}$ is useless because it is too wiggly. The solution is to smooth the estimator; this corresponds to shrinking the raw estimates ${\hat\theta_j}$ towards 0. This adds bias but reduces variance. In other words, the familiar process of smoothing, which we use all the time for function estimation, can be seen as “shrinking estimates towards 0” as with the James-Stein estimator.

If you are familiar with minimax theory, you might find the Stein paradox a bit confusing. The estimator ${\hat\theta = X}$ is minimax, that is, it’s risk achieves the minimax bound

$\displaystyle \inf_{\hat\theta}\sup_\theta R_{\hat\theta}(\theta).$

This suggests that ${X}$ is a good estimator. But Stein’s paradox tells us that ${\hat\theta = X}$ is inadmissible which suggests that it is a bad estimator.

No. The risk ${R_{\hat\theta}(\theta)}$ of ${\hat\theta=X}$ is a constant. In fact, ${R_{\hat\theta}(\theta)=k}$ for all ${\theta}$ where ${k}$ is the dimension of ${\theta}$. The risk ${R_{\theta^*}(\theta)}$ of the James-Stein estimator is less than the risk of ${X}$, but, ${R_{\theta^*}(\theta)\rightarrow k}$ as ${||\theta||\rightarrow \infty}$. So they have the same maximum risk.

On the one hand, this tells us that a minimax estimator can be inadmissible. On the other hand, in some sense it can’t be “too far” from admissible since they have the same maximum risk.

Stein first reported the paradox in 1956. I suspect that fewer and fewer people include the Stein paradox in their teaching. (I’m guilty.) This is a shame. Paradoxes really grab students’ attention. And, in this case, the paradox is really fundamental to many things including shrinkage estimators, hierarchical Bayes, and function estimation.

1. Posted May 18, 2013 at 5:48 pm | Permalink

Thanks Larry for bringing more attention to this. If your hunch that fewer people are teaching Stein’s paradox is correct, I think that’s awful! I’m lucky to have learned a lot about Stein’s paradox from my colleague Carl Morris, who was a student of Stein and pioneered the empirical Bayes approach/interpretation with Brad Efron. We teach it every year in our first year graduate inference course, relating it to shrinkage estimation, hierarchical models, regression toward the mean, and Stein’s Unbiased Risk Estimate (SURE).

The empirical Bayes explanation gives good intuition, as you mentioned, but I think it’s more than that. The Efron-Morris paper “Stein’s Estimation Rule and Its Competitors — An Empirical Bayes Approach” (JASA 1973, http://faculty.chicagobooth.edu/nicholas.polson/teaching/41900/efron-morris2.pdf ), has a proof that I consider stunningly beautiful. They give a rigorous proof of Stein’s result for the one-level model (no prior imposed on the theta_j), by first assuming the two-level model where the theta_j are themselves Normal with a common mean and variance. That sounds like a paradox in its own right: how can one assume such nice additional structure and then have any hope of obtaining the fully general result that assumes nothing about the theta_j? But they do exactly that, by using the notion of completeness of a statistic to reduce from the Bayes risk to the frequentist risk.

Then their classic “baseball paper” (JASA 1975, http://faculty.chicagobooth.edu/nicholas.polson/teaching/41900/efron-morris1.pdf ) showed that the gains from shrinkage estimation can be very substantial. The hierarchical model perspective, together with thinking carefully about the loss function, help clarify when it would make sense in practice to combine very different problems into a shrinkage estimator.

Also, I think it’s a bit misleading to suggest that the minimax estimator (which is also the MLE and has various other nice properties) is not “too far” from admissible just based on the supremum of the risk. The risk function increases in the squared norm of theta, starting at 2 and asymptotically approaching (but never reaching) k. If k is even moderately large, there will be a wide range of parameter values where the improvement in risk is dramatic.

This comment is getting long, but I also wanted to mention that Stigler’s paper (Stat Sci 1990, http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.ss/1177012274 ) gives a neat connection between Stein’s paradox and regression.

• Posted May 18, 2013 at 5:54 pm | Permalink

Hopefully my hunch is wrong!
Thanks for the references
Larry

• Posted May 19, 2013 at 12:08 am | Permalink

Brad Efron and Carl Morris’s 1977 Scientific American paper is an awesome intro on Stein Paradox for anyone who is uninitiated in statistics like me. http://www-stat.stanford.edu/~ckirby/brad/other/Article1977.pdf

• bayesrules
Posted May 19, 2013 at 10:03 am | Permalink

Indeed, this is where I first learned about Stein’s paradox. To this day I can recall being outraged by the non-intuitive fact that you can decrease the risk of estimating several things even if they have nothing to do with each other by doing this sort of thing. (I should say that at the time I was mostly doing astronomy…I knew only the statistics that astronomers learned in graduate school, and it was a decade before I became interested in Bayesian ideas and in fact it was before I knew that they existed! It was later that Carl Morris came to Texas and I had the opportunity to learn about it from him, and even later that I learned about shrinkage estimators, hierarchical Bayes models and so on).

I do introduce Stein’s paradox to my grad students, and I have a friend over in the medical school who has been using shrinkage estimators in his work on hospital outcomes and who has given guest lectures using the Efron-Morris article in Scientific American to my sophomore honors students in my course on Bayesian inference and decision theory (I talked about this course at Jim Berger’s 60th birthday party in San Antonio). He is particularly enamored of Brad and Carl’s toxoplasmosis example, also in the SA article.

2. Posted May 18, 2013 at 6:26 pm | Permalink

I didn’t learned about it school but we learned about something that reminds a lot to this paradox, the fact that the MLE for the variance on a normal distribution was not the one with LSE (the LSE having the denominator N+1 instead N). When I asked why people would not use the LSE insead the unbiased esitmator (denominator N-1) my professor kinda waved the question with an “Not everything that shines is Gold”.

So I guess the reason why people still uses MLE instead Steins or other LSE estimators is because the formers might produce big errors occasionally despite having less error on average; in other words, people prefer to be safe than sorry and having small advantages on average might not be worth be horribly wrong sometimes… Nobody wants to be a victim of Murphy’s law.

• Christian Hennig
Posted May 22, 2013 at 9:36 am | Permalink

Fran: I think that the squared loss (and in fact any symmetric loss) is inappropriate for variance estimation. If you use 1/(N+1), you favour small negative errors over slightly larger positive errors, but the former are more problematic in most situations because the variance is bounded by zero from below.

3. rj444
Posted May 18, 2013 at 9:29 pm | Permalink

I didn’t come from a traditional statistics training, so I personally stumbled across it on my own through Efron’s popular science article in scientific american. It certainly blew my mind when I first tried to wrap my head around it.

If we’re going to start teaching data analysts (whether they’re machine learners, computer scientists, or statisticians) how to work with high dimensional data (which seems to be paying the bills these days), Stein’s paradox should really be foundational and not obscure. Perhaps then we’d stop seeing entire fields doing analyses with 100,000 independent MLE estimates / hypothesis tests.

4. Corey
Posted May 19, 2013 at 1:02 pm | Permalink

I’d be curious to hear your views on the relevance (or lack thereof) of Wald’s complete class theorem to statistical inference.

• Posted May 19, 2013 at 1:42 pm | Permalink

I don’t think about complete class theorems at all

• Corey
Posted May 19, 2013 at 6:11 pm | Permalink

Huh. Stein was working on necessary and sufficient conditions for admissibility (i.e., generic conditions for complete classes) around the same time that he found his inadmissibility result…

5. bayesrules
Posted May 19, 2013 at 2:56 pm | Permalink

Larry, you wrote: “The proof that $X$ is inadmissible is based on defining an explicit estimator $\theta^*$ that has smaller risk than $X$.”

I had thought that Stein’s original proof was nonconstructive, and only later did he and James come up with the James-Stein estimator that you discuss here. (I admit that I haven’t read Stein’s original paper so my impression may be incorrect).

• Posted May 19, 2013 at 2:58 pm | Permalink

I should have said “a proof” rather than
“the proof”

• Posted May 26, 2013 at 5:27 pm | Permalink

Another interesting thing to mention about the paradox is that the James-Stein estimator is also inadmissible, since it’s dominated by the positive-part James-Stein estimator, which is also inadmissible — as far as I know, no-one has come up with an admissible estimator that dominates the sample average.

6. george
Posted May 19, 2013 at 3:32 pm | Permalink

Have you considered viewing the Stein result as a criticism of the quadratic loss, or of admissibility?

In your example with the mass of the moon etc, is it obvious we should be using the specified loss, if it rewards shrinking together totally unrelated quantities? Similarly, does insisting on admissibility, instead of not “too far” from admissible, really reflect the class of estimators we’re prepared to use in practice?

Of course, these concerns don’t rule out using shrinkage estimators sometimes.

• Posted May 19, 2013 at 3:37 pm | Permalink

True. That loss assumes you are interested in the overall error.
You might end up estimating some components poorly.
It is, nonetheless, still a rather surprising phenomenon.

• Posted May 21, 2013 at 8:38 am | Permalink

True, but somehow averaging over errors though not assuming similarity assumes not terribly different (or exchangeable?)

7. Posted May 19, 2013 at 11:31 pm | Permalink

“Similarly, statisticians used to say: do you remember where you were when you heard about Stein’s paradox? That’s how surprising it was. (I don’t remember since I wasn’t born yet.)”

Um, you weren’t born yet when you first heard about Stein’s Paradox?

8. Z
Posted May 20, 2013 at 12:52 am | Permalink

Are there alternative loss functions that take care of Stein’s paradox and don’t introduce new paradoxes of their own?

9. Peyman Milanfar
Posted May 20, 2013 at 11:49 pm | Permalink

Lamentably in (statistical) signal processing applications, we do not teach this at all. This is all the more surprising given that shrinkage estimators are used routinely.

10. Zen
Posted May 21, 2013 at 1:21 am | Permalink

I would add that if $||X|| < k – 2$, then it "shrinks" past the origin.

• Posted May 21, 2013 at 3:05 am | Permalink

True. In practice, one uses the
“positive part” shrinkage estimator which avoids this
problem.

11. Zen
Posted May 21, 2013 at 1:22 pm | Permalink

Just one more quick comment: if we use a prior $N_n(0,\tau^2 I)$ for $\theta$, a simple computation shows that the Bayes estimator with quadratic loss is $X \frac{\tau^2}{\tau^2 + 1}$. The complete class Theorem of Wald tell us that this Bayes estimator is admissible. Now, since we can take $\tau$ to be a huge number (say a Google), and that makes this admissible estimator almost surely as close to $X$ as we may want, should we consider the James-Stein estimator as a real improvement over just $X$?

• Posted May 21, 2013 at 1:33 pm | Permalink

I’m not sure what you mean by saying they are “almost surely close.”
I guess you mean that X is not Bayes but is the limit of Bayes rules
(which is how one proves it is minimax).
Nonetheless, the risk function
of X and the James-Stein estimator are very different near the origin.
But, with a very flat prior, we would put low prior probability near the origin
so we would not be impressed by improvement in that region.
In other words, shrinking towards the origin’ and using a very flat prior’
are at odds with each other.
If we’re interested in shrinkage estimators one might say that we are implicitly
interested in a prior with mass near the origin.

• george
Posted May 21, 2013 at 8:32 pm | Permalink

“one might say that we are implicitly interested in a prior with mass near the origin”

Not really. Such a prior is one way, but not the only way, to justify shrinkage estimators. We may use e.g. lasso estimators, or their close relatives, not because we have prior belief that several coefficients are truly zero or close to zero, but because we have a loss that rewards estimates that contain several zero or near-zero terms.

12. Zen
Posted May 21, 2013 at 2:28 pm | Permalink

Thank you for your reply, Larry. What I meant by “almost surely close” is that we would be certain that for any given realization of the experiment, the numbers $X$ and $X \frac{\tau^2}{\tau^2 + 1}$ are not much different. I mean, in practice, both the admissible $X \frac{\tau^2}{\tau^2 + 1}$ and the inadmissible $X$ give us essentially the same estimate. I’m thinking about a huge $\tau$, and I won’t consider the limit, because that would take us “outside of the complete class”, in a sense. Just to be completely clear, the estimator $0.9999999999999999999999 X$ is admissible. Your contrast between “shrinking towards the origin” and “using a very flat prior” is interesting.

13. Zen
Posted May 21, 2013 at 2:31 pm | Permalink

Sorry to bother you again. You know that the James-Stein estimator can be constructed shirinking towards an arbitrary point, and not just the origin. If you wear an applied statistician’s hat, how do you interpret the particular chosen point? How should we choose it? Thanks.

• Posted May 21, 2013 at 3:17 pm | Permalink

By empirical Bayes usually

14. Matías
Posted May 21, 2013 at 6:35 pm | Permalink

I’m not so sure about shrinking togehter estimations of the moon’s mass, Rome’s temperature, etc. For Stein’s result the means need not be related in any way, but the centered distibutions must be equal (having then, homogenous variances). Personally, besides needing to be interested in overall error, I would neither shrink estimations if there is no clue that they have a similar distirbution (up to position).
Anyway as you say, regularized regression like the Lasso and every smoothing technique can be thought of as shrinking. So the Stein paradox remains useful, surprising and even slightly unbelievable (Sometimes I still battle with it before being convinced again of it’s truth; for that use, the first comment’s link from Stigler is really very clear).
Thanks for the post

15. Christian Hennig
Posted May 22, 2013 at 9:55 am | Permalink

My intuition is that the first thing to think very carefully about here has to be the squared loss function. Obviously robustniks don’t like it because it is too much dominated by large deviations, or in other words by “how bad an estimator exactly is given that it’s useless anyway” instead of focussing on where the estimator can be of some use. Don’t know how strongly the robustness (or overweighting large deviations) issue is related to the issue here.
I mean, this works for shrinking toward *any* point, so the X is basically biased at random to get its variance down (which has nothing to do with the true value we want to estimate). So the variance seems to be overrated by this loss function.
Does anything like this happen for L1-loss?

Not that I think that shrinking never helps, but…

16. Posted June 10, 2013 at 1:07 am | Permalink

Whats special about 1 and 2 dimensional space that prevents this technique from working there? I know Brown has a giant paper about it which I never read…

17. Mike A
Posted September 9, 2013 at 9:29 pm | Permalink

How true! I was in college, sitting in my room on 11th St in Boulder, CO and read a Sci Am piece on Stein’s Paradox!

(I am also old enough to remember where I was when JFK was shot.) 😦

1. By Somewhere else, part 54 | Freakonometrics on May 19, 2013 at 9:39 pm

[…] On Stein’s paradox https://normaldeviate.wordpress.com/… […]

2. […] total sense. And once you get it, you’ll have a much deeper understanding of everything from nonparametric smoothing to empirical Bayes methods. Check out this wonderful, totally non-technical paper on Stein’s […]