<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Normal Deviate</title>
	<atom:link href="http://normaldeviate.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://normaldeviate.wordpress.com</link>
	<description>Thoughts on Statistics and Machine Learning</description>
	<lastBuildDate>Thu, 20 Jun 2013 07:13:53 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='normaldeviate.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Normal Deviate</title>
		<link>http://normaldeviate.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://normaldeviate.wordpress.com/osd.xml" title="Normal Deviate" />
	<atom:link rel='hub' href='http://normaldeviate.wordpress.com/?pushpress=hub'/>
		<item>
		<title>Stephen Ziliak Rejects Significance Testing</title>
		<link>http://normaldeviate.wordpress.com/2013/06/14/stephen-ziliak-rejects-significance-testing/</link>
		<comments>http://normaldeviate.wordpress.com/2013/06/14/stephen-ziliak-rejects-significance-testing/#comments</comments>
		<pubDate>Fri, 14 Jun 2013 08:23:54 +0000</pubDate>
		<dc:creator>normaldeviate</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://normaldeviate.wordpress.com/?p=429</guid>
		<description><![CDATA[In an opinion piece in the Financial Post, Stephen Ziliak goes into the land of hyperbole, declaring that all significance testing is junk science. It starts like this: I want to believe as much as the next person that particle physicists have discovered a Higgs boson, the so-called &#8220;God particle,&#8221; one with a mass of [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=normaldeviate.wordpress.com&#038;blog=36942929&#038;post=429&#038;subd=normaldeviate&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>
In an opinion piece in the <a class="snap_noshots" href="http://opinion.financialpost.com/2013/06/10/junk-science-week-unsignificant-statistics/">Financial Post</a>, Stephen Ziliak goes into the land of hyperbole, declaring that all significance testing is junk science. It starts like this:</p>
<p>
<em>I want to believe as much as the next person that particle physicists have discovered a Higgs boson, the so-called &#8220;God particle,&#8221; one with a mass of 125 gigaelectronic volts (GeV). But so far I do not buy the statistical claims being made about the discovery. Since the claims about the evidence are based on &#8220;statistical significance&#8221; &#8211; that is, on the number of standard deviations by which the observed signal departs from a null hypothesis of &#8220;no difference&#8221; &#8211; the physicists&#8217; claims are not believable. Statistical significance is junk science, and its big piles of nonsense are spoiling the research of more than particle physicists.</em></p>
<p>
He goes on to say:</p>
<p>
<em>Statistical significance stinks. In statistical sciences from economics to medicine, including some parts of physics and chemistry, the ubiquitous &#8220;test&#8221; for &#8220;statistical significance&#8221; cannot, and will not, prove that a Higgs boson exists, any more than it can prove the reality of God, the existence of a good pain pill, or the validity of loose monetary policy.</em></p>
<p>
While I have said many times in this blog that I, too, think significance testing is mis-used, it is ridiculous to jump to the conclusion that &#8220;Statistical significance is junk science.&#8221; Ironically, Mr. Ziliak is engaging in exactly the same all-or-nothing thinking that he is criticizing.</p>
<p>
You name any statistical method: confidence intervals, Bayesian inference, etc. and it is easy to find people mis-using it. The fact that people mis-use or misunderstand a statistical method does not render it dangerous. The blind and misinformed use of <em>any</em> statistical method is dangerous. Statistical ignorance is the enemy. Mr. Ziliak&#8217;s singular focus on the evils of testing seems more cultish than scientific.</p>
<p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/normaldeviate.wordpress.com/429/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/normaldeviate.wordpress.com/429/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=normaldeviate.wordpress.com&#038;blog=36942929&#038;post=429&#038;subd=normaldeviate&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://normaldeviate.wordpress.com/2013/06/14/stephen-ziliak-rejects-significance-testing/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/37312c618a28c7d016d4bbe4060f23b1?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">normaldeviate</media:title>
		</media:content>
	</item>
		<item>
		<title>Happy Birthday Normal Deviate</title>
		<link>http://normaldeviate.wordpress.com/2013/06/12/happy-birthday-normal-deviate/</link>
		<comments>http://normaldeviate.wordpress.com/2013/06/12/happy-birthday-normal-deviate/#comments</comments>
		<pubDate>Wed, 12 Jun 2013 11:22:53 +0000</pubDate>
		<dc:creator>normaldeviate</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://normaldeviate.wordpress.com/?p=426</guid>
		<description><![CDATA[Today is the one year anniversary of this blog. First of all, thanks to all the readers. And special thanks to commenters and guest posters. This seems like a good time to assess whether I have achieved my goals for the blog and to get suggestions on how I might proceed in year two. GOALS. [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=normaldeviate.wordpress.com&#038;blog=36942929&#038;post=426&#038;subd=normaldeviate&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>
Today is the one year anniversary of this blog. First of all, thanks to all the readers. And special thanks to commenters and guest posters. This seems like a good time to assess whether I have achieved my goals for the blog and to get suggestions on how I might proceed in year two.</p>
<p><a href="http://normaldeviate.files.wordpress.com/2013/06/cake.png"><img src="http://normaldeviate.files.wordpress.com/2013/06/cake.png?w=232&#038;h=217" alt="cake" width="232" height="217" class="aligncenter size-full wp-image-427" /></a></p>
<p><b>GOALS.</b> My goals in starting the blog were:</p>
<p>
(1) To discuss random things that I happen to find interesting.</p>
<p>
(2) To discuss ideas at the interface of statistics and machine learning.</p>
<p>
(3) To post every other day.</p>
<p>
Goal 1: Achieved.</p>
<p>
Goal 2: Partially achieved.</p>
<p>
Goal 3: Failed miserably. I was clearly too ambitious. I am lucky if I post once per week.</p>
<p>
<b>THE BEST AND WORST.</b> Favorite post: <a class="snap_noshots" href="http://normaldeviate.wordpress.com/2012/12/08/flat-priors-in-flatland-stones-paradox/">flatland</a>. I still think this is one of the coolest and deepest paradoxes in statistics.</p>
<p>
Least Favorite Post: <a class="snap_noshots" href="http://normaldeviate.wordpress.com/2013/05/05/aaronson-colt-bayesians-and-frequentists/">This post</a> where I was dismissive of PAC learning. I think I was just in a bad mood.</p>
<p>
<b>LESSON LEARNED.</b> Put &#8220;Bayes&#8221;, &#8220;Frequentist&#8221; or &#8220;p-value&#8221; in the title of a blog post and you get zillions of hits. Put some combination of them and get even more. If I really wanted to get a big readership I would just post exclusively about this stuff. But it would get boring pretty fast.</p>
<p>
<b>GOING FORWARD.</b> I hope to keep posting about once per week. But I don&#8217;t have any plans to make any specific changes to the blog. I am, however, open to suggestions.</p>
<p>
Any suggestions for making the blog more interesting or more fun?</p>
<p>
Any suggestions for inducing more people to write comments?</p>
<p>
Any topics you would like me to cover? (I already promised to do one on Simpson&#8217;s paradox).</p>
<p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/normaldeviate.wordpress.com/426/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/normaldeviate.wordpress.com/426/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=normaldeviate.wordpress.com&#038;blog=36942929&#038;post=426&#038;subd=normaldeviate&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://normaldeviate.wordpress.com/2013/06/12/happy-birthday-normal-deviate/feed/</wfw:commentRss>
		<slash:comments>23</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/37312c618a28c7d016d4bbe4060f23b1?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">normaldeviate</media:title>
		</media:content>

		<media:content url="http://normaldeviate.files.wordpress.com/2013/06/cake.png" medium="image">
			<media:title type="html">cake</media:title>
		</media:content>
	</item>
		<item>
		<title>The Value of Adding Randomness</title>
		<link>http://normaldeviate.wordpress.com/2013/06/09/the-value-of-adding-randomness/</link>
		<comments>http://normaldeviate.wordpress.com/2013/06/09/the-value-of-adding-randomness/#comments</comments>
		<pubDate>Sun, 09 Jun 2013 14:23:11 +0000</pubDate>
		<dc:creator>normaldeviate</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://normaldeviate.wordpress.com/?p=424</guid>
		<description><![CDATA[In computer science it is common to use randomized algorithms. The same is true in statistics: there are many ways that adding randomness can make things easier. But the way that randomness enters, varies quite a bit in different methods. I thought it might be interesting to collect some specific examples of statistical procedures where [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=normaldeviate.wordpress.com&#038;blog=36942929&#038;post=424&#038;subd=normaldeviate&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>
In computer science it is common to use <a class="snap_noshots" href="http://en.wikipedia.org/wiki/Randomized_algorithm">randomized algorithms</a>. The same is true in statistics: there are many ways that adding randomness can make things easier. But the way that randomness enters, varies quite a bit in different methods. I thought it might be interesting to collect some specific examples of statistical procedures where added randomness plays some role. (I am not referring to the randomness inherent in the original data but, rather, I refer to randomness in the statistical method itself.)</p>
<p>
<b>(1) Randomization in causal inference.</b> The mean difference <img src='http://s0.wp.com/latex.php?latex=%7B%5Calpha%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;alpha}' title='{&#92;alpha}' class='latex' /> between a treated group and untreated group is not, in general, equal to the causal effect <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;theta}' title='{&#92;theta}' class='latex' />. (Correlation is not causation.) Moreover, <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;theta}' title='{&#92;theta}' class='latex' /> is not identifiable. But if we randomly assign people to the two groups then, magically, <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta+%3D%5Calpha%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;theta =&#92;alpha}' title='{&#92;theta =&#92;alpha}' class='latex' />. This is easily proved using either the directed graph approach to causation or the counterfactual approach: see <a class="snap_noshots" href="http://normaldeviate.wordpress.com/2012/06/18/48/">here</a> for example. This fact is so elementary that we tend to forget how amazing it is. Of course, this is the reason we spend millions of dollars doing randomized trials.</p>
<p>
(As an aside, some people say that there is no role for randomization in Bayesian inference. In other words, the randomization mechanism plays no role in Bayes&#8217; theorem. But this is not really true. Without randomization, we can indeed derive a posterior for <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;theta}' title='{&#92;theta}' class='latex' /> but it is highly sensitive to the prior. This is just a restatement of the non-identifiability of <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;theta}' title='{&#92;theta}' class='latex' />. With randomization, the posterior is much less sensitive to the prior. And I think most practical Bayesians would consider it valuable to increase the robustness of the posterior.)</p>
<p>
<b>(2) Permutation Tests.</b> If <img src='http://s0.wp.com/latex.php?latex=%7BX_1%2C%5Cldots%2C+X_n+%5Csim+P%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{X_1,&#92;ldots, X_n &#92;sim P}' title='{X_1,&#92;ldots, X_n &#92;sim P}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%7BY_1%2C%5Cldots%2C+Y_m+%5Csim+Q%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{Y_1,&#92;ldots, Y_m &#92;sim Q}' title='{Y_1,&#92;ldots, Y_m &#92;sim Q}' class='latex' /> and you want to test <img src='http://s0.wp.com/latex.php?latex=%7BH_0%3A+P%3DQ%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{H_0: P=Q}' title='{H_0: P=Q}' class='latex' /> versus <img src='http://s0.wp.com/latex.php?latex=%7BH_1%3A+P%5Cneq+Q%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{H_1: P&#92;neq Q}' title='{H_1: P&#92;neq Q}' class='latex' />, you can get an exact, distribution-free test by using the permutation method. See <a class="snap_noshots" href="http://normaldeviate.wordpress.com/2012/07/14/modern-two-sample-tests/">here</a>. We rarely search over all permutations. Instead, we randomly select a large number of permutations. The result is still exact (i.e. the p-value is sub-uniform under <img src='http://s0.wp.com/latex.php?latex=%7BH_0%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{H_0}' title='{H_0}' class='latex' />.)</p>
<p>
<b>(3) The Bootstrap.</b> I discussed the bootstrap <a class="snap_noshots" href="http://normaldeviate.wordpress.com/2013/01/19/bootstrapping-and-subsampling-part-i/">here</a>. Basically, to compute a confidence interval, we approximate the distribution
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle++L_n%3D%5Cmathbb%7BP%7D%28+%5Csqrt%7Bn%7D%28%5Chat%5Ctheta+-+%5Ctheta%29+%5Cleq+t%29+&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;displaystyle  L_n=&#92;mathbb{P}( &#92;sqrt{n}(&#92;hat&#92;theta - &#92;theta) &#92;leq t) ' title='&#92;displaystyle  L_n=&#92;mathbb{P}( &#92;sqrt{n}(&#92;hat&#92;theta - &#92;theta) &#92;leq t) ' class='latex' /></p>
<p> with the conditional distribution
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle++%5Chat+L_n%3D%5Cmathbb%7BP%7D%28+%5Csqrt%7Bn%7D%28%5Chat%5Ctheta%5E%2A+-+%5Chat%5Ctheta%29+%5Cleq+t%5C+%7C+X_1%2C%5Cldots%2C+X_n%29+&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;displaystyle  &#92;hat L_n=&#92;mathbb{P}( &#92;sqrt{n}(&#92;hat&#92;theta^* - &#92;hat&#92;theta) &#92;leq t&#92; | X_1,&#92;ldots, X_n) ' title='&#92;displaystyle  &#92;hat L_n=&#92;mathbb{P}( &#92;sqrt{n}(&#92;hat&#92;theta^* - &#92;hat&#92;theta) &#92;leq t&#92; | X_1,&#92;ldots, X_n) ' class='latex' /></p>
<p> where <img src='http://s0.wp.com/latex.php?latex=%7B%5Chat%5Ctheta+%3D+g%28X_1%2C%5Cldots%2C+X_n%29%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;hat&#92;theta = g(X_1,&#92;ldots, X_n)}' title='{&#92;hat&#92;theta = g(X_1,&#92;ldots, X_n)}' class='latex' />, <img src='http://s0.wp.com/latex.php?latex=%7B%5Chat%5Ctheta%5E%2A+%3D+g%28X_1%5E%2A%2C%5Cldots%2C+X_n%5E%2A%29%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;hat&#92;theta^* = g(X_1^*,&#92;ldots, X_n^*)}' title='{&#92;hat&#92;theta^* = g(X_1^*,&#92;ldots, X_n^*)}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%7BX_1%5E%2A%2C%5Cldots%2C+X_n%5E%2A%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{X_1^*,&#92;ldots, X_n^*}' title='{X_1^*,&#92;ldots, X_n^*}' class='latex' /> is a sample from the empirical distribution. But the distribution <img src='http://s0.wp.com/latex.php?latex=%7B%5Chat+L_n%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;hat L_n}' title='{&#92;hat L_n}' class='latex' /> is intractable. Instead, we approximate it by repeatedly sampling from the empirical distribution function. This makes otherwise intractable confidence intervals trivial to compute.</p>
<p>
<b>(4) <img src='http://s0.wp.com/latex.php?latex=%7Bk%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{k}' title='{k}' class='latex' />-means++</b>. Minimizing the objective function in <img src='http://s0.wp.com/latex.php?latex=%7Bk%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{k}' title='{k}' class='latex' />-means clustering is NP-hard. Remarkably, as I discussed <a class="snap_noshots" href="http://normaldeviate.wordpress.com/2012/09/30/the-remarkable-k-means/">here</a>, the <img src='http://s0.wp.com/latex.php?latex=%7Bk%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{k}' title='{k}' class='latex' />-means++ algorithm uses a careful randomization method for choosing starting values and gets close to the minimum with high probability.</p>
<p>
<b>(5) Cross-Validation.</b> Some forms of cross-validation involve splitting the data randomly into two or more groups. We use one part(s) for fitting and the other(s) for testing. Some people seem bothered by the randomness this introduces. But it makes risk estimation easy and accurate.</p>
<p>
<b>(6) MCMC.</b> An obvious and common use of randomness is random sampling from a posterior distribution, usually by way of Markov Chain Monte Carlo. This can dramatically simplify Bayesian inference.</p>
<p>
These are the first few things that came to my mind. Are there others I should add to the list?</p>
<p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/normaldeviate.wordpress.com/424/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/normaldeviate.wordpress.com/424/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=normaldeviate.wordpress.com&#038;blog=36942929&#038;post=424&#038;subd=normaldeviate&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://normaldeviate.wordpress.com/2013/06/09/the-value-of-adding-randomness/feed/</wfw:commentRss>
		<slash:comments>19</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/37312c618a28c7d016d4bbe4060f23b1?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">normaldeviate</media:title>
		</media:content>
	</item>
		<item>
		<title>Steve Marron on &#8220;Big Data&#8221;</title>
		<link>http://normaldeviate.wordpress.com/2013/05/28/steve-marron-on-big-data/</link>
		<comments>http://normaldeviate.wordpress.com/2013/05/28/steve-marron-on-big-data/#comments</comments>
		<pubDate>Tue, 28 May 2013 18:14:08 +0000</pubDate>
		<dc:creator>normaldeviate</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://normaldeviate.wordpress.com/?p=422</guid>
		<description><![CDATA[Steve Marron is a statistician at UNC. In his younger days he was well known for his work on nonparametric theory. These days he works on a number of interesting things including analysis of structured objects (like tree-structured data) and high dimensional theory. Steve sent me a thoughtful email the other day about &#8220;Big Data&#8221; [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=normaldeviate.wordpress.com&#038;blog=36942929&#038;post=422&#038;subd=normaldeviate&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>
<a class="snap_noshots" href="http://www.unc.edu/~marron/marron.html">Steve Marron</a> is a statistician at UNC. In his younger days he was well known for his work on nonparametric theory. These days he works on a number of interesting things including analysis of structured objects (like tree-structured data) and high dimensional theory.</p>
<p>
Steve sent me a thoughtful email the other day about &#8220;Big Data&#8221; and, with his permission, I am posting it here.</p>
<p>
I agree with pretty much everything he says. I especially like these two gems: First, &#8220;a better funded statistical community would be a more efficient way to get such things done without all this highly funded re-discovery.&#8221; And second: &#8220;I can see a strong reason why it DOES NOT make sense to fund our community better. That is our community wide aversion to new ideas.&#8221;</p>
<p>
Enough from me. Here is Steve&#8217;s comment:</p>
<p><p align="center"> Guest Post, by Steve Marron </p>
<p>
My colleagues and I have lately been discussing &#8220;Big Data&#8221;, and <a class="snap_noshots" href="http://normaldeviate.wordpress.com/2013/04/13/data-science-the-end-of-statistics/">your blog</a> was mentioned.</p>
<p>
Not surprisingly you&#8217;ve got some interesting ideas there. Here come some of my own thoughts on the matter.</p>
<p>
First should one be pessimistic? I am not so sure. For me exhibit A is my own colleagues. When such things came up in the past (and I believe that this HAS happened, see the discussion below) my (at that time senior) colleagues were rather arrogantly ignorant. Issues such as you are raising were blatantly pooh poohed, if they were ever considered at all. However, this time around, I am seeing a far more different picture. My now mostly junior colleagues are taking this very seriously, and we are currently engaged in major discussion as to what we are going to do about this in very concrete terms such as course offerings, etc. In addition, while some of my colleagues think in terms of labels such as &#8220;applied statistician&#8221;, &#8220;theoretical statistician&#8221; and &#8220;probabilist&#8221;, everybody across the board is jumping in. Perhaps this is largely driven by an understanding that universities themselves are in a massive state of flux, and that one had better be a player, or else be totally left behind. But it sure looks better than some of the attitudes I saw earlier on in my career.</p>
<p>
Now about the bigger picture. I think there is an important history here that you are totally ignoring. In particular, I view &#8220;Big Data&#8221; as just the latest manifestation of a cycle that has been rolling along for quite a long time. Actually I have been predicting the advent of something of this type for quite a while (although I could not predict the name, nor the central idea).</p>
<p>
Here comes a personally slanted (certainly over-simplified) view of what I mean here. Think back on the following set of &#8220;exciting breakthroughs&#8221;:</p>
<p>
- Statistical Pattern Recognition<br />
 &#8211; Artificial Intelligence<br />
 &#8211; Neural Nets<br />
 &#8211; Data Mining<br />
 &#8211; Machine Learning</p>
<p>
Each of these was started up in EE/CS. Each was the fashionable hot topic (considered very sexy and fresh by funding agencies) of its day. Each was initially based on usually one really cool new idea, which was usually far outside of what folks working in conventional statistics had any hope (well certainly no encouragement from the statistical community) of dreaming up. I think each attracted much more NSF funding than all of statistics ever did, at any given time. A large share of the funding was used for re-invention of ideas that already existed in statistics (but would get a sexy new name). As each new field matured, there came a recognition that in fact much was to be gained by studying connections to statistics, so there was then lots of work &#8220;creating connections&#8221;.</p>
<p>
Now given the timing of these, and how they each have played out, over time, it had been clear to me for some time that we were ripe for the next one. So the current advent of Big Data is no surprise at all. Frankly I am a little disappointed that there does not seem to be any really compelling new idea (e.g. as in neural nets or the kernel embedding idea that drove machine learning). But I suspect that the need for such a thing to happen to keep this community properly funded has overcome the need for an exciting new idea. Instead of new methodology, this seems to be more driven by parallelism and cloud computing. Also I seem to see larger applied math buy-in than there ever was in the past. Maybe this is the new parallel to how optimization has appeared in a major way in machine learning.</p>
<p>
Next, what should we do about it? Number one of course is to get engaged, and as noted above, I am heartened at least at my own local level as discussed above.</p>
<p>
I generally agree with your comment about funding, and I can think of ways to sell statistics. For example, we should make the above history clear to funding agencies, and point out that in each case there has been a huge waste of resources on people doing a large amount of rediscovery. In most of those areas, by the time the big funding hits, the main ideas are already developed so the funding really just keeps lots of journeymen doing lots of very low impact work, with large amounts of rediscovery of things already known in the statistical community. The sell could be that a better funded statistical community would be a more efficient way to get such things done without all on this highly funded re-discovery.</p>
<p>
But before making such a case, I suggest that is it important to face up to our own shortcomings, from the perspective of funding agencies. I can see a strong reason why it DOES NOT make sense to fund our community better. That is our community wide aversion to new ideas. While I love working with statistical concepts, and have a personal love of new ideas, it has not escaped my notice that I have always been in something of a minority in that regard. We not only do not choose to reward creativity, we often tend to squelch it. I still remember the first time I applied for an NSF grant. I was ambitious, and the reviews I got back said the problem was interesting, but I had no track record, the reviewers were skeptical of me, and I did not get funded. This was especially frustrating as by the time I got those reviews I had solved the stated problem. It would be great if that could be regarded as an anomaly of the past when folks may have been less enlightened than now. However, I have direct evidence that this is not true. Unfortunately exactly that cycle repeated itself for one of my former students on this very last NSF cycle.</p>
<p>
What should we do to deserve more funding? Somehow we need a bigger tent, which is big enough to include the creative folks who will be coming up with the next really big ideas (big enough to include the folks who are going to spawn the next new community, such as those listed above). This is where research funding should really be going to be most effective.</p>
<p>
Maybe more important, we need to find a way to create a statistical culture that reveres new ideas, instead of fearing and shunning them.</p>
<p>
Best, Steve</p>
<p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/normaldeviate.wordpress.com/422/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/normaldeviate.wordpress.com/422/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=normaldeviate.wordpress.com&#038;blog=36942929&#038;post=422&#038;subd=normaldeviate&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://normaldeviate.wordpress.com/2013/05/28/steve-marron-on-big-data/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/37312c618a28c7d016d4bbe4060f23b1?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">normaldeviate</media:title>
		</media:content>
	</item>
		<item>
		<title>Brad Efron, Tornadoes, and Diane Sawyer</title>
		<link>http://normaldeviate.wordpress.com/2013/05/25/brad-efron-tornadoes-and-diane-sawyer/</link>
		<comments>http://normaldeviate.wordpress.com/2013/05/25/brad-efron-tornadoes-and-diane-sawyer/#comments</comments>
		<pubDate>Sat, 25 May 2013 17:03:30 +0000</pubDate>
		<dc:creator>normaldeviate</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://normaldeviate.wordpress.com/?p=420</guid>
		<description><![CDATA[Brad Efron wrote to me and posed an interesting statistical question: &#8220;Last Wednesday Diane Sawyer interviewed an Oklahoma woman who twice had had her home destroyed by a force-4 tornado. &#8220;A one in a hundred-trillion chance!&#8221; said Diane. ABC showed a nice map with the current storm&#8217;s track of destruction shaded in, about 18 miles [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=normaldeviate.wordpress.com&#038;blog=36942929&#038;post=420&#038;subd=normaldeviate&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Brad Efron wrote to me and posed an interesting statistical question:</p>
<p>     &#8220;Last Wednesday Diane Sawyer interviewed an Oklahoma woman who twice<br />
     had had her home destroyed by a force-4 tornado. &#8220;A one in a<br />
     hundred-trillion chance!&#8221; said Diane. ABC showed a nice map with the<br />
     current storm&#8217;s track of destruction shaded in, about 18 miles long<br />
     and 1 mile wide. Then the track of the 1999 storm was superimposed,<br />
     about the same dimensions, the two intersecting in a roughly 1 square<br />
     mile lozenge. Diane added that the woman &#8220;lives right in the center of<br />
     Tornado alley.&#8221;</p>
<p>     Question: what odds should have Diane quoted? (and for that matter,<br />
     what is the right event to consider?)</p>
<p>     Regards, Brad&#8221;</p>
<p>Anyone have a good answer?</p>
<p>By the way, I should add that Diane Sawyer has a history of<br />
broadcasting stories filled with numerical illiteracy.  She did a long<br />
series opposing the use of lean finely textured beef (LFTB), also<br />
known as &#8220;pink slime.&#8221;  In fact, LFTB is perfectly healthy, its use<br />
requires slaughtering many fewer cows each year and makes meat cheaper<br />
for poor people.  The series was denounced by many scientists and even<br />
environmentalists.  ABC is being sued for over one billion dollars.</p>
<p>She also did a long series on &#8220;Buy America&#8221; encouraging people to<br />
shun cheap goods from abroad. This is like telling people who live in<br />
Cleveland to shun buying any products and services not produced in<br />
Cleveland (including not watching ABC news which is produced in New<br />
York, or reading statistics papers not written in Cleveland.)  This<br />
high-school level mistake in economics is another example of Ms.<br />
Sawyer&#8217;s numerical illiteracy.</p>
<p>But I digress.</p>
<p>Let&#8217;s return to Brad&#8217;s question: </p>
<p>What is a good way to compute the odds that someone has their house<br />
destroyed by a tornado twice?</p>
<p>I open it up for discussion.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/normaldeviate.wordpress.com/420/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/normaldeviate.wordpress.com/420/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=normaldeviate.wordpress.com&#038;blog=36942929&#038;post=420&#038;subd=normaldeviate&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://normaldeviate.wordpress.com/2013/05/25/brad-efron-tornadoes-and-diane-sawyer/feed/</wfw:commentRss>
		<slash:comments>13</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/37312c618a28c7d016d4bbe4060f23b1?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">normaldeviate</media:title>
		</media:content>
	</item>
		<item>
		<title>STEIN&#8217;S PARADOX</title>
		<link>http://normaldeviate.wordpress.com/2013/05/18/steins-paradox/</link>
		<comments>http://normaldeviate.wordpress.com/2013/05/18/steins-paradox/#comments</comments>
		<pubDate>Sat, 18 May 2013 21:00:16 +0000</pubDate>
		<dc:creator>normaldeviate</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://normaldeviate.wordpress.com/?p=417</guid>
		<description><![CDATA[STEIN&#8217;S PARADOX Something that is well known in the statistics world but perhaps less well known in the machine learning world is Stein&#8217;s paradox. When I was growing up, people used to say: do you remember where you were when you heard that JFK died? (I was three, so I don&#8217;t remember. My first memory [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=normaldeviate.wordpress.com&#038;blog=36942929&#038;post=417&#038;subd=normaldeviate&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><p align="center"> STEIN&#8217;S PARADOX </p>
<p>
Something that is well known in the statistics world but perhaps less well known in the machine learning world is Stein&#8217;s paradox.</p>
<p>
When I was growing up, people used to say: do you remember where you were when you heard that JFK died? (I was three, so I don&#8217;t remember. My first memory is watching the Beatles on Ed Sullivan.)</p>
<p>
Similarly, statisticians used to say: do you remember where you were when you heard about Stein&#8217;s paradox? That&#8217;s how surprising it was. (I don&#8217;t remember since I wasn&#8217;t born yet.)</p>
<p>
Here is the paradox. Let <img src='http://s0.wp.com/latex.php?latex=%7BX+%5Csim+N%28%5Ctheta%2C1%29%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{X &#92;sim N(&#92;theta,1)}' title='{X &#92;sim N(&#92;theta,1)}' class='latex' />. Define the risk of an estimator <img src='http://s0.wp.com/latex.php?latex=%7B%5Chat%5Ctheta%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;hat&#92;theta}' title='{&#92;hat&#92;theta}' class='latex' /> to be
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle++R_%7B%5Chat%5Ctheta%7D%28%5Ctheta%29+%3D+%5Cmathbb%7BE%7D_%5Ctheta+%28%5Chat%5Ctheta-%5Ctheta%29%5E2+%3D+%5Cint+%28%5Chat%5Ctheta%28x%29+-+%5Ctheta%29%5E2+p%28x%3B%5Ctheta%29+dx.+&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;displaystyle  R_{&#92;hat&#92;theta}(&#92;theta) = &#92;mathbb{E}_&#92;theta (&#92;hat&#92;theta-&#92;theta)^2 = &#92;int (&#92;hat&#92;theta(x) - &#92;theta)^2 p(x;&#92;theta) dx. ' title='&#92;displaystyle  R_{&#92;hat&#92;theta}(&#92;theta) = &#92;mathbb{E}_&#92;theta (&#92;hat&#92;theta-&#92;theta)^2 = &#92;int (&#92;hat&#92;theta(x) - &#92;theta)^2 p(x;&#92;theta) dx. ' class='latex' /></p>
<p> An estimator <img src='http://s0.wp.com/latex.php?latex=%7B%5Chat%5Ctheta%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;hat&#92;theta}' title='{&#92;hat&#92;theta}' class='latex' /> is <em>inadmissible</em> if there is another estimator <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta%5E%2A%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;theta^*}' title='{&#92;theta^*}' class='latex' /> with smaller risk. In other words, if
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle++R_%7B%5Ctheta%5E%2A%7D%28%5Ctheta%29+%5Cleq+R_%7B%5Chat%5Ctheta%7D%28%5Ctheta%29+%5C+%5C+%7B%5Crm+for%5C+all%5C+%7D%5Ctheta+&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;displaystyle  R_{&#92;theta^*}(&#92;theta) &#92;leq R_{&#92;hat&#92;theta}(&#92;theta) &#92; &#92; {&#92;rm for&#92; all&#92; }&#92;theta ' title='&#92;displaystyle  R_{&#92;theta^*}(&#92;theta) &#92;leq R_{&#92;hat&#92;theta}(&#92;theta) &#92; &#92; {&#92;rm for&#92; all&#92; }&#92;theta ' class='latex' /></p>
<p> with strict inequality at at least one <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;theta}' title='{&#92;theta}' class='latex' />.</p>
<p>
Question: Is <img src='http://s0.wp.com/latex.php?latex=%7B%5Chat+%5Ctheta+%5Cequiv+X%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;hat &#92;theta &#92;equiv X}' title='{&#92;hat &#92;theta &#92;equiv X}' class='latex' /> admissible.<br />
 Answer: Yes.</p>
<p>
Now suppose that <img src='http://s0.wp.com/latex.php?latex=%7BX+%5Csim+N%28%5Ctheta%2CI%29%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{X &#92;sim N(&#92;theta,I)}' title='{X &#92;sim N(&#92;theta,I)}' class='latex' /> where now <img src='http://s0.wp.com/latex.php?latex=%7BX%3D%28X_1%2CX_2%29%5ET%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{X=(X_1,X_2)^T}' title='{X=(X_1,X_2)^T}' class='latex' />, <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta+%3D+%28%5Ctheta_1%2C%5Ctheta_2%29%5ET%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;theta = (&#92;theta_1,&#92;theta_2)^T}' title='{&#92;theta = (&#92;theta_1,&#92;theta_2)^T}' class='latex' /> and
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle++R_%7B%5Chat%5Ctheta%7D%28%5Ctheta%29+%3D+%5Cmathbb%7BE%7D_%5Ctheta+%7C%7C%5Chat%5Ctheta+-+%5Ctheta%7C%7C%5E2.+&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;displaystyle  R_{&#92;hat&#92;theta}(&#92;theta) = &#92;mathbb{E}_&#92;theta ||&#92;hat&#92;theta - &#92;theta||^2. ' title='&#92;displaystyle  R_{&#92;hat&#92;theta}(&#92;theta) = &#92;mathbb{E}_&#92;theta ||&#92;hat&#92;theta - &#92;theta||^2. ' class='latex' /></p>
<p> Question: Is <img src='http://s0.wp.com/latex.php?latex=%7B%5Chat+%5Ctheta+%5Cequiv+X%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;hat &#92;theta &#92;equiv X}' title='{&#92;hat &#92;theta &#92;equiv X}' class='latex' /> admissible.<br />
 Answer: Yes.</p>
<p>
Now suppose that <img src='http://s0.wp.com/latex.php?latex=%7BX+%5Csim+N%28%5Ctheta%2CI%29%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{X &#92;sim N(&#92;theta,I)}' title='{X &#92;sim N(&#92;theta,I)}' class='latex' /> where now <img src='http://s0.wp.com/latex.php?latex=%7BX%3D%28X_1%2CX_2%2CX_3%29%5ET%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{X=(X_1,X_2,X_3)^T}' title='{X=(X_1,X_2,X_3)^T}' class='latex' />, <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta+%3D+%28%5Ctheta_1%2C%5Ctheta_2%2C%5Ctheta_3%29%5ET%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;theta = (&#92;theta_1,&#92;theta_2,&#92;theta_3)^T}' title='{&#92;theta = (&#92;theta_1,&#92;theta_2,&#92;theta_3)^T}' class='latex' /> and
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle++R_%7B%5Chat%5Ctheta%7D%28%5Ctheta%29+%3D+%5Cmathbb%7BE%7D_%5Ctheta+%7C%7C%5Chat%5Ctheta+-+%5Ctheta%7C%7C%5E2.+&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;displaystyle  R_{&#92;hat&#92;theta}(&#92;theta) = &#92;mathbb{E}_&#92;theta ||&#92;hat&#92;theta - &#92;theta||^2. ' title='&#92;displaystyle  R_{&#92;hat&#92;theta}(&#92;theta) = &#92;mathbb{E}_&#92;theta ||&#92;hat&#92;theta - &#92;theta||^2. ' class='latex' /></p>
<p> Question: is <img src='http://s0.wp.com/latex.php?latex=%7B%5Chat+%5Ctheta+%5Cequiv+X%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;hat &#92;theta &#92;equiv X}' title='{&#92;hat &#92;theta &#92;equiv X}' class='latex' /> admissible.<br />
 Answer: No!</p>
<p>
If you don&#8217;t find this surprising then either you&#8217;ve heard this before or you&#8217;re not thinking hard enough. Keep in mind that the coordinates of the vector <img src='http://s0.wp.com/latex.php?latex=%7BX%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{X}' title='{X}' class='latex' /> are independent. And the <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta_j%27s%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;theta_j&#039;s}' title='{&#92;theta_j&#039;s}' class='latex' /> could have nothing to do with each other. For example, <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta_1+%3D+%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;theta_1 = }' title='{&#92;theta_1 = }' class='latex' /> mass of the moon, <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta_2+%3D+%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;theta_2 = }' title='{&#92;theta_2 = }' class='latex' /> price of coffee and <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta_3+%3D+%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;theta_3 = }' title='{&#92;theta_3 = }' class='latex' /> temperature in Rome.</p>
<p>
In general, <img src='http://s0.wp.com/latex.php?latex=%7B%5Chat%5Ctheta+%5Cequiv+X%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;hat&#92;theta &#92;equiv X}' title='{&#92;hat&#92;theta &#92;equiv X}' class='latex' /> is inadmissible if the dimension <img src='http://s0.wp.com/latex.php?latex=%7Bk%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{k}' title='{k}' class='latex' /> of <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;theta}' title='{&#92;theta}' class='latex' /> satisfies <img src='http://s0.wp.com/latex.php?latex=%7Bk+%5Cgeq+3%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{k &#92;geq 3}' title='{k &#92;geq 3}' class='latex' />.</p>
<p>
The proof that <img src='http://s0.wp.com/latex.php?latex=%7BX%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{X}' title='{X}' class='latex' /> is inadmissible is based on defining an explicit estimator <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta%5E%2A%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;theta^*}' title='{&#92;theta^*}' class='latex' /> that has smaller risk than <img src='http://s0.wp.com/latex.php?latex=%7BX%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{X}' title='{X}' class='latex' />. For example, the <a class="snap_noshots" href="http://en.wikipedia.org/wiki/James-Stein_estimator">James-Stein estimator</a> is
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle++%5Ctheta%5E%2A+%3D+%5Cleft%28+1+-+%5Cfrac%7Bk-2%7D%7B%7C%7CX%7C%7C%5E2%7D%5Cright%29+X.+&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;displaystyle  &#92;theta^* = &#92;left( 1 - &#92;frac{k-2}{||X||^2}&#92;right) X. ' title='&#92;displaystyle  &#92;theta^* = &#92;left( 1 - &#92;frac{k-2}{||X||^2}&#92;right) X. ' class='latex' /></p>
<p> It can be show that the risk of this estimator is strictly smaller than the risk of <img src='http://s0.wp.com/latex.php?latex=%7BX%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{X}' title='{X}' class='latex' />, for all <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;theta}' title='{&#92;theta}' class='latex' />. This implies that <img src='http://s0.wp.com/latex.php?latex=%7BX%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{X}' title='{X}' class='latex' /> is inadmissible. If you want to see the detailed calculations, have a look at Iain Johnstone&#8217;s at <a class="snap_noshots" href="http://www-stat.stanford.edu">this site</a> which he makes freely available on his website.</p>
<p>
Note that the James-Stein estimator shrinks <img src='http://s0.wp.com/latex.php?latex=%7BX%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{X}' title='{X}' class='latex' /> towards the origin. (In fact, you can shrink towards any point; there is nothing special about the origin.) This can be viewed as an empirical Bayes estimator where <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;theta}' title='{&#92;theta}' class='latex' /> has a prior of the form <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta+%5Csim+N%280%2C%5Ctau%5E2%29%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;theta &#92;sim N(0,&#92;tau^2)}' title='{&#92;theta &#92;sim N(0,&#92;tau^2)}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctau%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;tau}' title='{&#92;tau}' class='latex' /> is estimated from the data. The Bayes explanation gives some nice intuition. But it&#8217;s also a bit misleading. The Bayes explanation suggests we are shrinking the means together because we expect them <em>a priori</em> to be similar. But the paradox holds even when the means are not related in any way.</p>
<p>
Some intuition can be gained by thinking about function estimation. Consider a smooth function <img src='http://s0.wp.com/latex.php?latex=%7Bf%28x%29%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{f(x)}' title='{f(x)}' class='latex' />. Suppose we have data
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle++Y_i+%3D+f%28x_i%29+%2B+%5Cepsilon_i+&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;displaystyle  Y_i = f(x_i) + &#92;epsilon_i ' title='&#92;displaystyle  Y_i = f(x_i) + &#92;epsilon_i ' class='latex' /></p>
<p> where <img src='http://s0.wp.com/latex.php?latex=%7Bx_i+%3D+i%2Fn%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{x_i = i/n}' title='{x_i = i/n}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%7B%5Cepsilon_i+%5Csim+N%280%2C1%29%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;epsilon_i &#92;sim N(0,1)}' title='{&#92;epsilon_i &#92;sim N(0,1)}' class='latex' />. Let us expand <img src='http://s0.wp.com/latex.php?latex=%7Bf%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{f}' title='{f}' class='latex' /> in an orthonormal basis: <img src='http://s0.wp.com/latex.php?latex=%7Bf%28x%29+%3D+%5Csum_j+%5Ctheta_j+%5Cpsi_j%28x%29%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{f(x) = &#92;sum_j &#92;theta_j &#92;psi_j(x)}' title='{f(x) = &#92;sum_j &#92;theta_j &#92;psi_j(x)}' class='latex' />. To estimate <img src='http://s0.wp.com/latex.php?latex=%7Bf%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{f}' title='{f}' class='latex' /> we need only estimate the coefficients <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta_1%2C%5Ctheta_2%2C%5Cldots%2C%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;theta_1,&#92;theta_2,&#92;ldots,}' title='{&#92;theta_1,&#92;theta_2,&#92;ldots,}' class='latex' />. Note that <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta_j+%3D+%5Cint+f%28x%29+%5Cpsi_j%28x%29+dx%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;theta_j = &#92;int f(x) &#92;psi_j(x) dx}' title='{&#92;theta_j = &#92;int f(x) &#92;psi_j(x) dx}' class='latex' />. This suggests the estimator
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle++%5Chat%5Ctheta_j+%3D+%5Cfrac%7B1%7D%7Bn%7D%5Csum_%7Bi%3D1%7D%5En+Y_i+%5Cpsi_j%28x_i%29.+&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;displaystyle  &#92;hat&#92;theta_j = &#92;frac{1}{n}&#92;sum_{i=1}^n Y_i &#92;psi_j(x_i). ' title='&#92;displaystyle  &#92;hat&#92;theta_j = &#92;frac{1}{n}&#92;sum_{i=1}^n Y_i &#92;psi_j(x_i). ' class='latex' /></p>
<p> But the resulting function estimator <img src='http://s0.wp.com/latex.php?latex=%7B%5Chat+f%28x%29+%3D+%5Csum_j+%5Chat%5Ctheta_j+%5Cpsi_j%28x%29%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;hat f(x) = &#92;sum_j &#92;hat&#92;theta_j &#92;psi_j(x)}' title='{&#92;hat f(x) = &#92;sum_j &#92;hat&#92;theta_j &#92;psi_j(x)}' class='latex' /> is useless because it is too wiggly. The solution is to smooth the estimator; this corresponds to shrinking the raw estimates <img src='http://s0.wp.com/latex.php?latex=%7B%5Chat%5Ctheta_j%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;hat&#92;theta_j}' title='{&#92;hat&#92;theta_j}' class='latex' /> towards 0. This adds bias but reduces variance. In other words, the familiar process of smoothing, which we use all the time for function estimation, can be seen as &#8220;shrinking estimates towards 0&#8221; as with the James-Stein estimator.</p>
<p>
If you are familiar with minimax theory, you might find the Stein paradox a bit confusing. The estimator <img src='http://s0.wp.com/latex.php?latex=%7B%5Chat%5Ctheta+%3D+X%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;hat&#92;theta = X}' title='{&#92;hat&#92;theta = X}' class='latex' /> is minimax, that is, it&#8217;s risk achieves the minimax bound
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle++%5Cinf_%7B%5Chat%5Ctheta%7D%5Csup_%5Ctheta+R_%7B%5Chat%5Ctheta%7D%28%5Ctheta%29.+&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;displaystyle  &#92;inf_{&#92;hat&#92;theta}&#92;sup_&#92;theta R_{&#92;hat&#92;theta}(&#92;theta). ' title='&#92;displaystyle  &#92;inf_{&#92;hat&#92;theta}&#92;sup_&#92;theta R_{&#92;hat&#92;theta}(&#92;theta). ' class='latex' /></p>
<p> This suggests that <img src='http://s0.wp.com/latex.php?latex=%7BX%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{X}' title='{X}' class='latex' /> is a good estimator. But Stein&#8217;s paradox tells us that <img src='http://s0.wp.com/latex.php?latex=%7B%5Chat%5Ctheta+%3D+X%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;hat&#92;theta = X}' title='{&#92;hat&#92;theta = X}' class='latex' /> is inadmissible which suggests that it is a bad estimator.</p>
<p>
Is there a contradiction here?</p>
<p>
No. The risk <img src='http://s0.wp.com/latex.php?latex=%7BR_%7B%5Chat%5Ctheta%7D%28%5Ctheta%29%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{R_{&#92;hat&#92;theta}(&#92;theta)}' title='{R_{&#92;hat&#92;theta}(&#92;theta)}' class='latex' /> of <img src='http://s0.wp.com/latex.php?latex=%7B%5Chat%5Ctheta%3DX%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;hat&#92;theta=X}' title='{&#92;hat&#92;theta=X}' class='latex' /> is a constant. In fact, <img src='http://s0.wp.com/latex.php?latex=%7BR_%7B%5Chat%5Ctheta%7D%28%5Ctheta%29%3Dk%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{R_{&#92;hat&#92;theta}(&#92;theta)=k}' title='{R_{&#92;hat&#92;theta}(&#92;theta)=k}' class='latex' /> for all <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;theta}' title='{&#92;theta}' class='latex' /> where <img src='http://s0.wp.com/latex.php?latex=%7Bk%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{k}' title='{k}' class='latex' /> is the dimension of <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;theta}' title='{&#92;theta}' class='latex' />. The risk <img src='http://s0.wp.com/latex.php?latex=%7BR_%7B%5Ctheta%5E%2A%7D%28%5Ctheta%29%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{R_{&#92;theta^*}(&#92;theta)}' title='{R_{&#92;theta^*}(&#92;theta)}' class='latex' /> of the James-Stein estimator is less than the risk of <img src='http://s0.wp.com/latex.php?latex=%7BX%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{X}' title='{X}' class='latex' />, but, <img src='http://s0.wp.com/latex.php?latex=%7BR_%7B%5Ctheta%5E%2A%7D%28%5Ctheta%29%5Crightarrow+k%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{R_{&#92;theta^*}(&#92;theta)&#92;rightarrow k}' title='{R_{&#92;theta^*}(&#92;theta)&#92;rightarrow k}' class='latex' /> as <img src='http://s0.wp.com/latex.php?latex=%7B%7C%7C%5Ctheta%7C%7C%5Crightarrow+%5Cinfty%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{||&#92;theta||&#92;rightarrow &#92;infty}' title='{||&#92;theta||&#92;rightarrow &#92;infty}' class='latex' />. So they have the same <em>maximum risk</em>.</p>
<p>
On the one hand, this tells us that a minimax estimator can be inadmissible. On the other hand, in some sense it can&#8217;t be &#8220;too far&#8221; from admissible since they have the same maximum risk.</p>
<p>
Stein first reported the paradox in 1956. I suspect that fewer and fewer people include the Stein paradox in their teaching. (I&#8217;m guilty.) This is a shame. Paradoxes really grab students&#8217; attention. And, in this case, the paradox is really fundamental to many things including shrinkage estimators, hierarchical Bayes, and function estimation.</p>
<p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/normaldeviate.wordpress.com/417/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/normaldeviate.wordpress.com/417/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=normaldeviate.wordpress.com&#038;blog=36942929&#038;post=417&#038;subd=normaldeviate&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://normaldeviate.wordpress.com/2013/05/18/steins-paradox/feed/</wfw:commentRss>
		<slash:comments>31</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/37312c618a28c7d016d4bbe4060f23b1?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">normaldeviate</media:title>
		</media:content>
	</item>
		<item>
		<title>Aaronson, COLT, Bayesians and Frequentists</title>
		<link>http://normaldeviate.wordpress.com/2013/05/05/aaronson-colt-bayesians-and-frequentists/</link>
		<comments>http://normaldeviate.wordpress.com/2013/05/05/aaronson-colt-bayesians-and-frequentists/#comments</comments>
		<pubDate>Mon, 06 May 2013 01:05:02 +0000</pubDate>
		<dc:creator>normaldeviate</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://normaldeviate.wordpress.com/?p=412</guid>
		<description><![CDATA[Aaronson, COLT, Bayesians and Frequentists I am reading Scott Aaronson&#8217;s book &#8220;Quantum Computing Since Democritus&#8221; which can be found here. The book is about computational complexity, quantum mechanics, quantum computing and many other things. It&#8217;s a great book and I highly recommend it. Much of the material on complexity classes is tough going but you [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=normaldeviate.wordpress.com&#038;blog=36942929&#038;post=412&#038;subd=normaldeviate&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><p align="center"> Aaronson, COLT, Bayesians and Frequentists </p>
<p>
I am reading Scott Aaronson&#8217;s book &#8220;Quantum Computing Since Democritus&#8221; which can be found <a class="snap_noshots" href="http://www.amazon.com/Quantum-Computing-since-Democritus-Aaronson/dp/0521199565/ref=sr_1_1?s=books&amp;ie=UTF8&amp;qid=1367788194&amp;sr=1-1&amp;keywords=scott+aaronson">here</a>.</p>
<p>
The book is about computational complexity, quantum mechanics, quantum computing and many other things. It&#8217;s a great book and I highly recommend it. Much of the material on complexity classes is tough going but you can skim over some of the details and still enjoy the book. (That&#8217;s what I am doing.) There at least 495 different complexity classes: see the <a class="snap_noshots" href="https://complexityzoo.uwaterloo.ca/Complexity_Zoo">complexity zoo</a>. I don&#8217;t know how anyone can keep track of this.</p>
<p>
Anyway, there is a chapter on computational learning theory that I wanted to comment on. (There is another chapter about probabilistic reasoning and the anthropic principle which I&#8217;ll comment on in a future post.) Scott gives a clear introduction to learning theory and he correctly traces the birth of the theory to Leslie Valiant&#8217;s 1984 paper that introduced PAC (probably almost correct) learning. He also contrasts PAC learning with Bayesian learning.</p>
<p>
Now I want to put on my statistical curmudgeon hat and complain about computational learning theory. My complaint is this: the discovery of computational learning theory is nothing but the re-discovery of a 100 year old statistical idea called a &#8220;confidence interval.&#8221;</p>
<p>
Let&#8217;s review some basic learning theory. Let <img src='http://s0.wp.com/latex.php?latex=%7B%7B%5Ccal+R%7D%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{{&#92;cal R}}' title='{{&#92;cal R}}' class='latex' /> denote all axis aligned rectangles in the plane. We can think of each rectangle <img src='http://s0.wp.com/latex.php?latex=%7BR%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{R}' title='{R}' class='latex' /> as a classifier: predict <img src='http://s0.wp.com/latex.php?latex=%7BY%3D1%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{Y=1}' title='{Y=1}' class='latex' /> if <img src='http://s0.wp.com/latex.php?latex=%7BX%5Cin+R%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{X&#92;in R}' title='{X&#92;in R}' class='latex' /> and predict <img src='http://s0.wp.com/latex.php?latex=%7BY%3D0%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{Y=0}' title='{Y=0}' class='latex' /> if <img src='http://s0.wp.com/latex.php?latex=%7BX%5Cnotin+R%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{X&#92;notin R}' title='{X&#92;notin R}' class='latex' />. Suppose we have data <img src='http://s0.wp.com/latex.php?latex=%7B%28X_1%2CY_1%29%2C%5Cldots%2C+%28X_n%2CY_n%29%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{(X_1,Y_1),&#92;ldots, (X_n,Y_n)}' title='{(X_1,Y_1),&#92;ldots, (X_n,Y_n)}' class='latex' />. If we pick the rectangle <img src='http://s0.wp.com/latex.php?latex=%7B%5Chat+R%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;hat R}' title='{&#92;hat R}' class='latex' /> that makes the fewest classification errors on the data, will we predict well on a new observation? More formally: is the empirical risk estimator a good predictor?</p>
<p>
Yes. The reason is simple. Let <img src='http://s0.wp.com/latex.php?latex=%7BL%28R%29+%3D+%5Cmathbb%7BP%7D%28Y%5Cnotin+R%29%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{L(R) = &#92;mathbb{P}(Y&#92;notin R)}' title='{L(R) = &#92;mathbb{P}(Y&#92;notin R)}' class='latex' /> be the prediction risk and let
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle++L_n%28R%29+%3D+%5Cfrac%7B1%7D%7Bn%7D%5Csum_%7Bi%3D1%7D%5En+I%28Y_i+%5Cnotin+R%29+&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;displaystyle  L_n(R) = &#92;frac{1}{n}&#92;sum_{i=1}^n I(Y_i &#92;notin R) ' title='&#92;displaystyle  L_n(R) = &#92;frac{1}{n}&#92;sum_{i=1}^n I(Y_i &#92;notin R) ' class='latex' /></p>
<p> be the empirical estimate of the risk. We would like to claim that <img src='http://s0.wp.com/latex.php?latex=%7B%5Chat+R%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;hat R}' title='{&#92;hat R}' class='latex' /> is close to the best classifier in <img src='http://s0.wp.com/latex.php?latex=%7B%7B%5Ccal+R%7D%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{{&#92;cal R}}' title='{{&#92;cal R}}' class='latex' />. That is, we would like to show that <img src='http://s0.wp.com/latex.php?latex=%7BL%28%5Chat+R%29%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{L(&#92;hat R)}' title='{L(&#92;hat R)}' class='latex' /> is close to <img src='http://s0.wp.com/latex.php?latex=%7B%5Cinf_%7BR%5Cin+%7B%5Ccal+R%7D%7D+L%28R%29%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;inf_{R&#92;in {&#92;cal R}} L(R)}' title='{&#92;inf_{R&#92;in {&#92;cal R}} L(R)}' class='latex' />, with high probability. This fact follows easily if we can show that
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle++%5Csup_%7BR%5Cin+%7B%5Ccal+R%7D%7D+%7C+L_n%28R%29+-+L%28R%29%7C+&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;displaystyle  &#92;sup_{R&#92;in {&#92;cal R}} | L_n(R) - L(R)| ' title='&#92;displaystyle  &#92;sup_{R&#92;in {&#92;cal R}} | L_n(R) - L(R)| ' class='latex' /></p>
<p> is small with high probability. And this does hold since the VC dimension of <img src='http://s0.wp.com/latex.php?latex=%7B%7B%5Ccal+R%7D%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{{&#92;cal R}}' title='{{&#92;cal R}}' class='latex' /> is finite. Specifically, the key fact is that, for any distribution of the data <img src='http://s0.wp.com/latex.php?latex=%7BP%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{P}' title='{P}' class='latex' />, we have
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle++P%5En%5CBigl%28%5Csup_%7BR%5Cin+%7B%5Ccal+R%7D%7D+%7C+L_n%28R%29+-+L%28R%29%7C+%3E+%5Cepsilon%5CBigr%29+%5Cleq+c_1+%5Cexp%5Cleft%28+-+c_2+n+%5Cepsilon%5E2+%5Cright%29+&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;displaystyle  P^n&#92;Bigl(&#92;sup_{R&#92;in {&#92;cal R}} | L_n(R) - L(R)| &gt; &#92;epsilon&#92;Bigr) &#92;leq c_1 &#92;exp&#92;left( - c_2 n &#92;epsilon^2 &#92;right) ' title='&#92;displaystyle  P^n&#92;Bigl(&#92;sup_{R&#92;in {&#92;cal R}} | L_n(R) - L(R)| &gt; &#92;epsilon&#92;Bigr) &#92;leq c_1 &#92;exp&#92;left( - c_2 n &#92;epsilon^2 &#92;right) ' class='latex' /></p>
<p> for known constants <img src='http://s0.wp.com/latex.php?latex=%7Bc_1%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{c_1}' title='{c_1}' class='latex' /> and <img src='http://s0.wp.com/latex.php?latex=%7Bc_2%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{c_2}' title='{c_2}' class='latex' />.</p>
<p>
But this is equivalent to saying that a <img src='http://s0.wp.com/latex.php?latex=%7B1-%5Calpha%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{1-&#92;alpha}' title='{1-&#92;alpha}' class='latex' /> confidence interval for <img src='http://s0.wp.com/latex.php?latex=%7BL%28%5Chat+R%29%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{L(&#92;hat R)}' title='{L(&#92;hat R)}' class='latex' /> is
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle++C_n+%3D+%5BL_n%28%5Chat+R%29+-+%5Cepsilon_n%2C%5C+L_n%28%5Chat+R%29+%2B+%5Cepsilon_n%5D+&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;displaystyle  C_n = [L_n(&#92;hat R) - &#92;epsilon_n,&#92; L_n(&#92;hat R) + &#92;epsilon_n] ' title='&#92;displaystyle  C_n = [L_n(&#92;hat R) - &#92;epsilon_n,&#92; L_n(&#92;hat R) + &#92;epsilon_n] ' class='latex' /></p>
<p> where
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle++%5Cepsilon_n+%3D+%5Csqrt%7B%5Cfrac%7B1%7D%7Bn+c_2%7D%5Clog%5Cleft%28%5Cfrac%7Bc_1%7D%7B%5Calpha%7D%5Cright%29%7D.+&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;displaystyle  &#92;epsilon_n = &#92;sqrt{&#92;frac{1}{n c_2}&#92;log&#92;left(&#92;frac{c_1}{&#92;alpha}&#92;right)}. ' title='&#92;displaystyle  &#92;epsilon_n = &#92;sqrt{&#92;frac{1}{n c_2}&#92;log&#92;left(&#92;frac{c_1}{&#92;alpha}&#92;right)}. ' class='latex' /></p>
<p> That is, for any distribution <img src='http://s0.wp.com/latex.php?latex=%7BP%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{P}' title='{P}' class='latex' />,
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle++P%5En%28+L%28%5Chat+R%29+%5Cin+C_n%29%5Cgeq+1-%5Calpha.+&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;displaystyle  P^n( L(&#92;hat R) &#92;in C_n)&#92;geq 1-&#92;alpha. ' title='&#92;displaystyle  P^n( L(&#92;hat R) &#92;in C_n)&#92;geq 1-&#92;alpha. ' class='latex' /></p>
<p>
As Scott points out, what distinguishes this type of reasoning from Bayesian reasoning, is that we require this to hold uniformly, and that there are no priors involved. To quote from his book:</p>
<p>
<em>This goes against a belief in the Bayesian religion, that if your priors are different then you come to an entirely different conclusion. The Bayesian starts out with a probability distribution over the possible hypotheses. As you get more and more data, you update this distribution using Bayes&#8217;s rule.</p>
<p>
That&#8217;s one way to do it, but computational learning theory tells us that it&#8217;s not the only way. You don&#8217;t need to start out with any assumptions about a probability distribution over the hypotheses &#8230; you&#8217;d like to learn any hypothesis in the concept class, for any sample distribution, with high probability over the choice of samples. In other words, you can trade the Bayesians&#8217; probability distribution over hypotheses for a probability distribution over sample data.</em></p>
<p>
(Note: &#8220;hypothesis&#8221; = classifier and &#8220;concept class&#8221; = set of classifiers, &#8220;learn&#8221; = estimate).</p>
<p>
Now, I agree completely with the above quote. But as I said, it is basically the definition of a frequentist confidence interval.</p>
<p>
So my claim is that computational learning theory is just the application of frequentist confidence intervals to classification.</p>
<p>
There is nothing bad about that. The people who first developed learning theory were probably not aware of existing statistical theory so they re-developed it themselves and they did it right.</p>
<p>
But it&#8217;s my sense &#8212; and correct me if I&#8217;m wrong &#8212; that many people in computational learning theory are still woefully ignorant about the field of statistics. It would be nice if someone in the field read the statistics literature and said: &#8220;Hey, these statistics guys did this 50 years ago!&#8221;</p>
<p>
Am I being too harsh?</p>
<p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/normaldeviate.wordpress.com/412/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/normaldeviate.wordpress.com/412/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=normaldeviate.wordpress.com&#038;blog=36942929&#038;post=412&#038;subd=normaldeviate&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://normaldeviate.wordpress.com/2013/05/05/aaronson-colt-bayesians-and-frequentists/feed/</wfw:commentRss>
		<slash:comments>14</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/37312c618a28c7d016d4bbe4060f23b1?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">normaldeviate</media:title>
		</media:content>
	</item>
		<item>
		<title>The Perils of Hypothesis Testing &#8230; Again</title>
		<link>http://normaldeviate.wordpress.com/2013/04/27/the-perils-of-hypothesis-testing-again/</link>
		<comments>http://normaldeviate.wordpress.com/2013/04/27/the-perils-of-hypothesis-testing-again/#comments</comments>
		<pubDate>Sun, 28 Apr 2013 02:16:51 +0000</pubDate>
		<dc:creator>normaldeviate</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://normaldeviate.wordpress.com/?p=408</guid>
		<description><![CDATA[A few months ago I posted about John Ioannidis&#8217; article called &#8220;Why Most Published Research Findings Are False.&#8221; Ioannidis is once again making news by publishing a similar article aimed at neuroscientists. This paper is called &#8220;Power failure: why small sample size undermines the reliability of neuroscience.&#8221; The paper is written by Button, Ioannidis, Mokrysz, [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=normaldeviate.wordpress.com&#038;blog=36942929&#038;post=408&#038;subd=normaldeviate&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>
A few months ago I posted about <a class="snap_noshots" href="http://normaldeviate.wordpress.com/2012/12/27/most-findings-are-false/">John Ioannidis&#8217;</a> article called &#8220;Why Most Published Research Findings Are False.&#8221;</p>
<p>
Ioannidis is once again making news by publishing a similar article aimed at neuroscientists. This <a class="snap_noshots" href="http://www.nature.com/nrn/journal/v14/n5/full/nrn3475.html">paper</a> is called &#8220;Power failure: why small sample size undermines the reliability of neuroscience.&#8221; The paper is written by Button, Ioannidis, Mokrysz, Nosek, Flint, Robinson and Munafo.</p>
<p>
When I discussed the first article, I said that his points were correct but hardly surprising. I thought it was fairly obvious that <img src='http://s0.wp.com/latex.php?latex=%7BP%28A%7CH_0%29+%5Cneq+P%28H_0%7CA%29%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{P(A|H_0) &#92;neq P(H_0|A)}' title='{P(A|H_0) &#92;neq P(H_0|A)}' class='latex' /> where <img src='http://s0.wp.com/latex.php?latex=%7BA%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{A}' title='{A}' class='latex' /> is the event that a result is declared significant and <img src='http://s0.wp.com/latex.php?latex=%7BH_0%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{H_0}' title='{H_0}' class='latex' /> is the event that the null hypothesis is true. But the fact that the paper had such a big impact made me realize that perhaps I was too optimistic. Apparently, this fact does need to be pointed out.</p>
<p>
The new paper has basically the same message although the emphasis is on the dangers of low power. Let us assume that for a fraction of studies <img src='http://s0.wp.com/latex.php?latex=%7B%5Cpi%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;pi}' title='{&#92;pi}' class='latex' />, the null is actually false. That is <img src='http://s0.wp.com/latex.php?latex=%7BP%28H_0%29+%3D+1-%5Cpi%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{P(H_0) = 1-&#92;pi}' title='{P(H_0) = 1-&#92;pi}' class='latex' />. Let <img src='http://s0.wp.com/latex.php?latex=%7B%5Cgamma%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;gamma}' title='{&#92;gamma}' class='latex' /> be the power. Then the probability of a false discovery, assuming we reject <img src='http://s0.wp.com/latex.php?latex=%7BH_0%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{H_0}' title='{H_0}' class='latex' /> when the p-value is less than <img src='http://s0.wp.com/latex.php?latex=%7B%5Calpha%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;alpha}' title='{&#92;alpha}' class='latex' />, is
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle++P%28H_0%7CA%29+%3D+%5Cfrac%7B+P%28A%7CH_0%29+P%28H_0%29%7D%7B+P%28A%7CH_0%29+P%28H_0%29%2B+P%28A%7CH_1%29+P%28H_1%29%7D+%3D+%5Cfrac%7B%5Calpha+%281-%5Cpi%29%7D%7B%5Calpha+%281-%5Cpi%29%2B+%5Cgamma+%5Cpi%7D.+&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;displaystyle  P(H_0|A) = &#92;frac{ P(A|H_0) P(H_0)}{ P(A|H_0) P(H_0)+ P(A|H_1) P(H_1)} = &#92;frac{&#92;alpha (1-&#92;pi)}{&#92;alpha (1-&#92;pi)+ &#92;gamma &#92;pi}. ' title='&#92;displaystyle  P(H_0|A) = &#92;frac{ P(A|H_0) P(H_0)}{ P(A|H_0) P(H_0)+ P(A|H_1) P(H_1)} = &#92;frac{&#92;alpha (1-&#92;pi)}{&#92;alpha (1-&#92;pi)+ &#92;gamma &#92;pi}. ' class='latex' /></p>
<p> Let us suppose, for the sake of illustration that <img src='http://s0.wp.com/latex.php?latex=%7B%5Cpi+%3D+0.1%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;pi = 0.1}' title='{&#92;pi = 0.1}' class='latex' /> (most nulls are true). Then the probability of a false discovery (using <img src='http://s0.wp.com/latex.php?latex=%7B%5Calpha%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;alpha}' title='{&#92;alpha}' class='latex' /> = 0.05) looks like this as a function of power:</p>
<p><a href="http://normaldeviate.files.wordpress.com/2013/04/false.png"><img src="http://normaldeviate.files.wordpress.com/2013/04/false.png?w=300&#038;h=300" alt="False" width="300" height="300" class="aligncenter size-medium wp-image-409" /></a></p>
<p>
So indeed, if the power is low, the chance of a false discovery is high. (And things are worse if we include the effects of bias.)</p>
<p>
The authors go on to estimate the typical neuroscience studies. They conclude that the typical power is between .08 and .31. I applaud them for trying to come up with some estimate of the typical power but I doubt that the estimate is very reliable.</p>
<p>
The paper concludes with a number of sensible recommendations such as: performing power calculations before doing a study, disclosing methods transparently and so on. I wish they had included one more recommendation: focus less on testing and more on estimation.</p>
<p>
So, like the first paper, I am left with the feeling that this message, too, is correct, but not surprising. But I guess that these points are not so obvious to many users of statistics. In that case, papers like these serve an important function.</p>
<p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/normaldeviate.wordpress.com/408/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/normaldeviate.wordpress.com/408/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=normaldeviate.wordpress.com&#038;blog=36942929&#038;post=408&#038;subd=normaldeviate&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://normaldeviate.wordpress.com/2013/04/27/the-perils-of-hypothesis-testing-again/feed/</wfw:commentRss>
		<slash:comments>20</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/37312c618a28c7d016d4bbe4060f23b1?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">normaldeviate</media:title>
		</media:content>

		<media:content url="http://normaldeviate.files.wordpress.com/2013/04/false.png?w=300" medium="image">
			<media:title type="html">False</media:title>
		</media:content>
	</item>
		<item>
		<title>Data Science: The End of Statistics?</title>
		<link>http://normaldeviate.wordpress.com/2013/04/13/data-science-the-end-of-statistics/</link>
		<comments>http://normaldeviate.wordpress.com/2013/04/13/data-science-the-end-of-statistics/#comments</comments>
		<pubDate>Sat, 13 Apr 2013 12:29:32 +0000</pubDate>
		<dc:creator>normaldeviate</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://normaldeviate.wordpress.com/?p=406</guid>
		<description><![CDATA[Data Science: The End of Statistics? As I see newspapers and blogs filled with talk of &#8220;Data Science&#8221; and &#8220;Big Data&#8221; I find myself filled with a mixture of optimism and dread. Optimism, because it means statistics is finally a sexy field. Dread, because statistics is being left on the sidelines. The very fact that [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=normaldeviate.wordpress.com&#038;blog=36942929&#038;post=406&#038;subd=normaldeviate&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>
Data Science: The End of Statistics?</p>
<p>
As I see newspapers and blogs filled with talk of &#8220;Data Science&#8221; and &#8220;Big Data&#8221; I find myself filled with a mixture of optimism and dread. Optimism, because it means statistics is finally a sexy field. Dread, because statistics is being left on the sidelines.</p>
<p>
The very fact that people can talk about data science without even realizing there is a field already devoted to the analysis of data &#8212; a field called statistics &#8212; is alarming. I like what <a class="snap_noshots" href="http://kbroman.wordpress.com/2013/04/05/data-science-is-statistics/">Karl Broman</a> says:</p>
<p>
<em>When physicists do mathematics, they don&#8217;t say they&#8217;re doing &#8220;number science&#8221;. They&#8217;re doing math.</p>
<p>
If you&#8217;re analyzing data, you&#8217;re doing statistics. You can call it data science or informatics or analytics or whatever, but it&#8217;s still statistics.</em></p>
<p>
Well put.</p>
<p>
Maybe I am just pessimistic and am just imagining that statistics is getting left out. Perhaps, but I don&#8217;t think so. It&#8217;s my impression that the attention and resources are going mainly to Computer Science. Not that I have anything against CS of course, but it is a tragedy if Statistics gets left out of this data revolution.</p>
<p>
Two questions come to mind:</p>
<p>
1. Why do statisticians find themselves left out?</p>
<p>
2. What can we do about it?</p>
<p>
I&#8217;d like to hear your ideas. Here are some random thoughts on these questions. First, regarding question 1.</p>
<p><ol>
<li> Here is a short parable: A scientist comes to a statistician with a question. The statistician responds by learning the scientific background behind the question. Eventually, after much thinking and investigation, the statistician produces a thoughtful answer. The answer is not just an answer but an answer with a standard error. And the standard error is often much larger than the scientist would like.</p>
<p>
The scientist goes to a computer scientist. A few days later the computer scientist comes back with spectacular graphs and fast software.</p>
<p>
Who would you go to?</p>
<p>
I am exaggerating of course. But there is some truth to this. We statisticians train our students to be slow and methodical and to question every assumption. These are good things but there is something to be said for speed and flashiness.</p>
<li> Generally, speaking, statisticians have limited computational skills. I saw a talk a few weeks ago in the machine learning department where the speaker dealt with a dataset of size 10 billion. And each data point had dimension 10,000. It was very impressive. Few statisticians have the skills to do calculations like this.
</ol>
<p>
On to question 2. What do we do about it?</p>
<p>
Whining won&#8217;t help. We can complain that that &#8220;data scientists&#8221; are ignoring biases, not computing standard errors, not stating and checking assumption and so on. No one is listening.</p>
<p>
First of all, we need to make sure our students are competitive. They need to be able to do serious computing, which means they need to understand data structures, distributed computing and multiple programming languages.</p>
<p>
Second, we need to hire CS people to be on the faculty in statistics department. This won&#8217;t be easy: how do we create incentives for computer scientists to take jobs in statistics departments?</p>
<p>
Third, statistics needs a separate division at NSF. Simply renaming DMS (Division of Mathematical Sciences) as has been debated, isn&#8217;t enough. We need our own pot of money. (I realize this isn&#8217;t going to happen.)</p>
<p>
To summarize, I don&#8217;t really have any ideas. Does anyone?</p>
<p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/normaldeviate.wordpress.com/406/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/normaldeviate.wordpress.com/406/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=normaldeviate.wordpress.com&#038;blog=36942929&#038;post=406&#038;subd=normaldeviate&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://normaldeviate.wordpress.com/2013/04/13/data-science-the-end-of-statistics/feed/</wfw:commentRss>
		<slash:comments>77</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/37312c618a28c7d016d4bbe4060f23b1?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">normaldeviate</media:title>
		</media:content>
	</item>
		<item>
		<title>Super-efficiency: &#8220;The Nasty, Ugly Little Fact&#8221;</title>
		<link>http://normaldeviate.wordpress.com/2013/04/05/super-efficiency-the-nasty-ugly-little-fact/</link>
		<comments>http://normaldeviate.wordpress.com/2013/04/05/super-efficiency-the-nasty-ugly-little-fact/#comments</comments>
		<pubDate>Fri, 05 Apr 2013 17:39:53 +0000</pubDate>
		<dc:creator>normaldeviate</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://normaldeviate.wordpress.com/?p=402</guid>
		<description><![CDATA[Super-efficiency: The Nasty, Ugly Little Fact I just read Steve Stigler&#8217;s wonderful article entitled: &#8220;The Epic Story of Maximum Likelihood.&#8221; I don&#8217;t know why I didn&#8217;t read this paper earlier. Like all of Steve&#8217;s papers, it is at once entertaining and scholarly. I highly recommend it to everyone. As the title suggests, the paper discusses [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=normaldeviate.wordpress.com&#038;blog=36942929&#038;post=402&#038;subd=normaldeviate&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>
Super-efficiency: The Nasty, Ugly Little Fact</p>
<p>
I just read Steve Stigler&#8217;s wonderful <a class="snap_noshots" href="http://projecteuclid.org/DPubS?service=UI&amp;version=1.0&amp;verb=Display&amp;handle=euclid.ss/1207580174">article</a> entitled: &#8220;The Epic Story of Maximum Likelihood.&#8221; I don&#8217;t know why I didn&#8217;t read this paper earlier. Like all of Steve&#8217;s papers, it is at once entertaining and scholarly. I highly recommend it to everyone.</p>
<p>
As the title suggests, the paper discusses the history of maximum likelihood with a focus on Fisher&#8217;s &#8220;proof&#8221; that the maximum likelihood estimator is optimal. The &#8220;nasty, ugly little fact&#8221; is the problem of super-efficiency.</p>
<p>
<p><b>1. Hodges Example </b></p>
<p><p>
Suppose that
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle++X_1%2C+%5Cldots%2C+X_n+%5Csim+N%28%5Ctheta%2C1%29.+&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;displaystyle  X_1, &#92;ldots, X_n &#92;sim N(&#92;theta,1). ' title='&#92;displaystyle  X_1, &#92;ldots, X_n &#92;sim N(&#92;theta,1). ' class='latex' /></p>
<p> The maximum likelihood estimator (mle) is
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle++%5Chat%5Ctheta+%3D+%5Coverline%7BX%7D_n+%3D+%5Cfrac%7B1%7D%7Bn%7D%5Csum_%7Bi%3D1%7D%5En+X_i.+&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;displaystyle  &#92;hat&#92;theta = &#92;overline{X}_n = &#92;frac{1}{n}&#92;sum_{i=1}^n X_i. ' title='&#92;displaystyle  &#92;hat&#92;theta = &#92;overline{X}_n = &#92;frac{1}{n}&#92;sum_{i=1}^n X_i. ' class='latex' /></p>
<p> We&#8217;d like to be able to say that the mle is, in some sense, optimal.</p>
<p>
The usual way we teach this, is to point out that <img src='http://s0.wp.com/latex.php?latex=%7BVar%28%5Chat%5Ctheta%29+%3D+1%2Fn%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{Var(&#92;hat&#92;theta) = 1/n}' title='{Var(&#92;hat&#92;theta) = 1/n}' class='latex' /> and that any other consistent estimator must have a variance which is at least this large (asymptotically).</p>
<p>
Hodges&#8217; famous example shows that this is not quite right. Hodges&#8217; estimator is:
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle++T_n+%3D+%5Cbegin%7Bcases%7D+%5Coverline%7BX%7D_n+%26+%5Cmbox%7Bif+%7D+%7C%5Coverline%7BX%7D_n%7C+%5Cgeq+%5Cfrac%7B1%7D%7Bn%5E%7B1%2F4%7D%7D%5C%5C+0+%26+%5Cmbox%7Bif+%7D+%7C%5Coverline%7BX%7D_n%7C+%3C+%5Cfrac%7B1%7D%7Bn%5E%7B1%2F4%7D%7D.+%5Cend%7Bcases%7D+&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;displaystyle  T_n = &#92;begin{cases} &#92;overline{X}_n &amp; &#92;mbox{if } |&#92;overline{X}_n| &#92;geq &#92;frac{1}{n^{1/4}}&#92;&#92; 0 &amp; &#92;mbox{if } |&#92;overline{X}_n| &lt; &#92;frac{1}{n^{1/4}}. &#92;end{cases} ' title='&#92;displaystyle  T_n = &#92;begin{cases} &#92;overline{X}_n &amp; &#92;mbox{if } |&#92;overline{X}_n| &#92;geq &#92;frac{1}{n^{1/4}}&#92;&#92; 0 &amp; &#92;mbox{if } |&#92;overline{X}_n| &lt; &#92;frac{1}{n^{1/4}}. &#92;end{cases} ' class='latex' /></p>
<p>
If <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta%5Cneq+0%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;theta&#92;neq 0}' title='{&#92;theta&#92;neq 0}' class='latex' /> then eventually <img src='http://s0.wp.com/latex.php?latex=%7BT_n+%3D+%5Coverline%7BX%7D_n%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{T_n = &#92;overline{X}_n}' title='{T_n = &#92;overline{X}_n}' class='latex' /> and hence
<p align="center"><img src='http://s0.wp.com/latex.php?latex=%5Cdisplaystyle++%5Csqrt%7Bn%7D%28T_n+-+%5Ctheta%29+%5Crightsquigarrow+N%280%2C1%29.+&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='&#92;displaystyle  &#92;sqrt{n}(T_n - &#92;theta) &#92;rightsquigarrow N(0,1). ' title='&#92;displaystyle  &#92;sqrt{n}(T_n - &#92;theta) &#92;rightsquigarrow N(0,1). ' class='latex' /></p>
<p> But if <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta+%3D0%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;theta =0}' title='{&#92;theta =0}' class='latex' />, then eventually <img src='http://s0.wp.com/latex.php?latex=%7B%5Coverline%7BX%7D_n%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;overline{X}_n}' title='{&#92;overline{X}_n}' class='latex' /> is in the window <img src='http://s0.wp.com/latex.php?latex=%7B%5B-n%5E%7B-1%2F4%7D%2Cn%5E%7B-1%2F4%7D%5D%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{[-n^{-1/4},n^{-1/4}]}' title='{[-n^{-1/4},n^{-1/4}]}' class='latex' /> and hence <img src='http://s0.wp.com/latex.php?latex=%7BT_n+%3D+0%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{T_n = 0}' title='{T_n = 0}' class='latex' />. i.e. it is equal to the true value. Thus, when <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta+%5Cneq+0%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;theta &#92;neq 0}' title='{&#92;theta &#92;neq 0}' class='latex' />, <img src='http://s0.wp.com/latex.php?latex=%7BT_n%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{T_n}' title='{T_n}' class='latex' /> behaves like the mle. But when <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta%3D0%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;theta=0}' title='{&#92;theta=0}' class='latex' />, it is better than the mle.</p>
<p>
Hence, the mle is not optimal, at least, not in the sense Fisher claimed.</p>
<p>
<p><b>2. Rescuing the mle </b></p>
<p><p>
Does this mean that the claim that the mle is optimal is doomed? Not quite. Here is a picture (from Wikipedia) of the risk of the Hodges estimator for various values of <img src='http://s0.wp.com/latex.php?latex=%7Bn%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{n}' title='{n}' class='latex' />:</p>
<p><a href="http://normaldeviate.files.wordpress.com/2013/04/hodges2.png"><img src="http://normaldeviate.files.wordpress.com/2013/04/hodges2.png?w=300&#038;h=300" alt="hodges2" width="300" height="300" class="aligncenter size-medium wp-image-403" /></a></p>
<p>
There is a price to pay for the small risk at <img src='http://s0.wp.com/latex.php?latex=%7B%5Ctheta%3D0%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;theta=0}' title='{&#92;theta=0}' class='latex' />: the risk for values near 0 is huge. Can we leverage the picture above into a precise statement about optimality?</p>
<p>
First, if we look at the maximum risk rather than the pointwise risk then we see that the mle is optimal. Indeed, <img src='http://s0.wp.com/latex.php?latex=%7B%5Coverline%7BX%7D_n%7D&amp;bg=ffffff&amp;fg=000000&amp;s=0' alt='{&#92;overline{X}_n}' title='{&#92;overline{X}_n}' class='latex' /> is the unique estimator that is minimax for all bowl-shaped estimators. See <a class="snap_noshots" href="http://normaldeviate.wordpress.com/2012/07/17/minimax-theory-saves-the-world/">my earlier post on this</a>.</p>
<p>
Second, Le Cam showed that the mle is optimal among all <em>regular</em> estimators. These are estimators whose distribution is not affected by small changes in the parameter. This is known as Le Cam&#8217;s convolution theorem because he showed that the limiting distribution of any regular estimator is equal to the distribution of the mle plus (convolved with) another distribution. (There are, of course, regularity assumptions involved.)</p>
<p>
Chapter 8 of van der Vaart (1998) is a good reference for these results.</p>
<p>
<p><b>3. Why Do We Care? </b></p>
<p><p>
The idea of all of this, was not to rescue the claim that &#8220;the mle is optimal&#8221; at any cost. Rather, we had a situation where it was intuitively clear that something was true in some sense but it was difficult to make it precise.</p>
<p>
Making the sense in which the mle is optimal precise represents an intellectual breakthrough in statistics. The deep mathematical tools that Le Cam developed have been used in many aspects of statistical theory. Two reviews of Le Cam theory can be found <a class="snap_noshots" href="http://projecteuclid.org/DPubS?service=UI&amp;version=1.0&amp;verb=Display&amp;handle=euclid.aos/1028674836">here</a> and <a class="snap_noshots" href="http://arxiv.org/abs/1107.3811">here</a>.</p>
<p>
That the mle is optimal seemed intuitively clear and yet turned out to be a subtle and deep fact. Are there other examples of this in Statistics and Machine Learning?</p>
<p>
<p><b> References </b></p>
<p><p>
Stigler, S. (2007). The epic story of maximum likelihood. <em>Statistical Science</em>, 22, 598-620.</p>
<p>
van der Vaart. (1998). <em>Asymptotic Statistics</em>. Cambridge.</p>
<p>
van der Vaart, Aad. (2002). The statistical work of Lucien Le Cam. <em>Ann. Statist.</em>, 30, 631-682.</p>
<p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/normaldeviate.wordpress.com/402/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/normaldeviate.wordpress.com/402/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=normaldeviate.wordpress.com&#038;blog=36942929&#038;post=402&#038;subd=normaldeviate&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://normaldeviate.wordpress.com/2013/04/05/super-efficiency-the-nasty-ugly-little-fact/feed/</wfw:commentRss>
		<slash:comments>17</slash:comments>
	
		<media:content url="http://0.gravatar.com/avatar/37312c618a28c7d016d4bbe4060f23b1?s=96&#38;d=identicon&#38;r=G" medium="image">
			<media:title type="html">normaldeviate</media:title>
		</media:content>

		<media:content url="http://normaldeviate.files.wordpress.com/2013/04/hodges2.png?w=300" medium="image">
			<media:title type="html">hodges2</media:title>
		</media:content>
	</item>
	</channel>
</rss>
