Most Findings Are False

Most Findings Are False

Many of you may know this paper by John Ioannidis called “Why Most Published Research Findings Are False.” Some people seem to think that the paper proves that there is something wrong with significance testing. This is not the correct conclusion to draw, as I’ll explain.

I will also mention a series of papers on a related topic by David Madigan; the papers are referenced at the end of this post. Madigan’s papers are more important than Ioannidis’ papers. Mathbabe has an excellent post about Madigan’s work.

Let’s start with Ioannidis. As the title suggests, the paper claims that many published results are false. This is not surprising to most statisticians and epidemiologists. Nevertheless, the paper has received much attention. Let’s suppose, as Ioannidis does, that “publishing a finding” is synonymous with “doing a test and finding that it is significant.” There are many reasons why published papers might have false findings. Among them are:

  1. From elementary probability

    \displaystyle  P(false\ positive|paper\ published) \neq P(false\ positive|null\ hypothesis\ true).

    In fact, the left hand side can be much larger than the right hand side but it is the quantity on the right hand side that we control with hypothesis testing.

  2. Bias. There are many biases in studies so even if the null hypothesis is true, the p-value will not have a Uniform (0,1) distribution. This leads to extra false rejections. There are too many sources of potential bias to list but common ones include: unobserved confounding variables and the tendency to only report studies with small p-values.

These facts are well-known, thus I was surprised that the paper received so much attention. All good epidemiologists know these things and they regard published findings with suitable caution. So, to me, this seems like much ado about nothing. Published findings are considered “suggestions of things to look into,” not “definitive final results.” Nor is this a condemnation of significance testing which is just a tool and, like all tools, should be properly understood. If a fool smashes his finger with a hammer we don’t condemn hammers. (The problem, if there is one, is not testing, but the press, who do report every study as if some definitive truth has been uncovered. But that’s a different story.)

Let me be clear about this: I am not suggesting we should treat every scientific problem as if it is a hypothesis testing problem. And if you have reason to include prior information into an analysis then by all means do so. But unless you have magic powers, simply doing a Bayesian analysis isn’t going to solve the problems above.

Let’s compute the probability of a false finding given that a paper is published. To do so, we will make numerous simplifying assumptions. Imagine we have a stream of studies. In each study, there are only two hypotheses, the null {H_0} and the alternative {H_1}. In some fraction {\pi} of the studies, {H_0} is true. Let {A} be the event that a study gets published. We do hypothesis testing and we publish just when we reject {H_0} at level {\alpha}. Assume further that every test has the same power {1-\beta}. Then the fraction of published studies with false findings is

\displaystyle  P(H_0|A) = \frac{P(A|H_0)P(H_0)}{P(A|H_0)P(H_0) + P(A|H_1)P(H_1)} = \frac{ \alpha \pi}{ \alpha \pi + (1-\beta)(1-\pi)}.

It’s clear that {P(H_0|A)} can be quite different from {\alpha}. We could recover {P(H_0|A)} if we knew {\pi}; but we don’t know {\pi} and just inserting your own subjective guess isn’t much help. And once we remove all the simplifying assumptions, it becomes much more complicated. But this is beside the point because the bigger issue is bias.

The bias problem is indeed serious. It infects any analysis you might do: tests, confidence intervals, Bayesian inference, or whatever your favorite method is. Bias transcends arguments about the choice of statistical methods.

Which brings me to Madigan. David Madigan and his co-workers have spent years doing sensitivity analyses on observational studies. This has been a huge effort involving many people and a lot of work.

They considered numerous studies and asked: what happens if we tweak the database, the study design, etc.? The results, although not surprising, are disturbing. The estimates of the effects vary wildly. And this only accounts for a small amount of the biases that can enter a study.

I do not have links to David’s papers (most are still in review) so I can’t show you all the pictures but here is one screenshot:

Madigan

Each horizontal line is one study; the dots show how the estimates change as one design variable is tweaked. This picture is just the tip of the iceberg. (It would be interesting to see if the type of sensitivity analysis proposed by Paul Rosenbaum is able to reveal the sensitivity of studies but it’s not clear if that will do the job.)

To summarize: many published findings are indeed false. But don’t blame this on significance testing, frequentist inference or incompetent epidemiologists. If anything, it is bias. But really, it is simply a fact. The cure is to educate people (and especially the press) that just because a finding is published doesn’t mean it’s true. And I think that the sensitivity analysis being developed by David Madigan and his colleagues will turn out to be essential.

References

Ryan, P.B., Madigan, D., Stang, P.E., Overhage, J.M., Racoosin, J.A., Hartzema, A.G. (2012). Empirical Assessment of Analytic Methods for Risk Identification in Observational Healthcare Data: Results from the Experiments of the Observational Medical Outcomes Partnership. Statistics in Medicine, to appear.

Ryan, P., Suchard, M.A., and Madigan, D. (2012). Learning from epidemiology: Interpreting observational studies for the effects of medical products. Submitted.

Schuemie, M.J., Ryan, P., DuMouchel, W., Suchard, M.A., and Madigan, D. (2012). Significantly misleading: Why p-values in observational studies are wrong and how to correct them. Submitted.

Madigan, D., Ryan, P., Schuemie, M., Stang, P., Overhage, M., Hartzema, A., Suchard, M.A., DuMouchel, W., and Berlin, J. (2012). Evaluating the impact of database heterogeneity on observational studies.

Guest Post: ROB TIBSHIRANI

GUEST POST: ROB TIBSHIRANI

Today we have a guest post by my good friend Rob Tibshirani. Rob has a list of nine great statistics papers. (He is too modest to include his own papers.) Have a look and let us know what papers you would add to the list. And what machine learning papers would you add? Enjoy.

9 Great Statistics papers published after 1970
Rob Tibshirani

I was thinking about influential and awe-inspiring papers in Statistics and thought it would be fun to make a list. This list will show my bias in favor of practical work, and by its omissions, my ignorance of many important subfields of Statistics. I hope that others will express their own opinions.

  1. Regression models and life tables (with discussion) (Cox 1972). A beautiful and elegant solution to an extremely important practical problem. Has had an enormous impact in medical science. David Cox deserves the Nobel Prize in Medicine for this work.
  2. Generalized linear models (Nelder and Wedderburn 1972). Formulated the class of generalized regression models for exponential family distributions. Provided the framework for the {\tt glim} package and the S and R modelling languages.
  3. Maximum Likelihood from Incomplete Data via the {EM} Algorithm (with discussion) (Dempster, Laird, and Rubin 1977). Brought together many related ideas for dealing with missing or messy data, in one conceptually simple and powerful framework.
  4. Bootstrap methods: another look at the jackknife (Efron 1979). Introduced one of the first computer-intensive statistical tools. Widely used in many scientific fields
  5. Classification and regression trees (Breiman, Friedman, Olshen and Stone 1984). Not a paper, but a book. Among the first proposals for data mining to demonstrate the power of a detailed practical implementation of a method, including cross-validation for model selection
  6. How biased is the error rate of a prediction rule? (Efron 1986). Greatly advanced our understanding of training and test error rates, and overfitting and ways to deal with them.
  7. Sampling based approaches to calculating marginal densities (Gelfand and Smith 1990). Buidling on earlier work by Geman and Geman, Tanner and Wong, and others, this paper developed a simple and elegant sampling-based method for estimating marginal densities. Huge impact on Bayesian work
  8. Controlling the false discovery rate: a practical and powerful approach to multiple testing (Benjamini and Hochberg 1995). Introduced the FDR and a selection procedure whose FDR is controlled at a given level. Enormously influential in the modern age of high-dimensional data.
  9. A decision-theoretic generalization of online learning and an application to boosting (Freund and Schapire 1995). Not a statistics paper per se, but one that introduced one of the most powerful supervised learning methods and changed the way that many of us thought about the prediction problem.

Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B., 85, 289-300.

Breiman, L. and Friedman, J. and Olshen, R. and Stone, C. (1984). Classification and Regression Trees, Wadsworth, New York.

Cox, D.R. (1972). Regression models and life tables (with discussion). J. Royal. Statist. Soc. B., 74, 187-220.

Dempster, A., Laird, N and Rubin, D. (1977). Maximum Likelihood from Incomplete Data via the {EM} Algorithm (with discussion). Journal of the Royal Statistical Society Series B, 39, 1-38.

Efron, B. (1979). Bootstrap methods: another look at the jackknife. Annals of Statistics, 7, 1-26.

Efron, B. (1986). How biased is the apparent error rate of a prediction rule? Journal of the American Statistical Association, 81, 461-470.

Freund, Y. and Schapire, R. (1997). A decision-theoretic generalization of online learning and an application to boosting. Journal of Computer and System Sciences, 55, 119-139.

Gelfand, A. and Smith, A. (1990). Sampling based approaches to calculating marginal densities. Journal of the American Statistical Association, 85, 398-409.

Nelder, J.A. and Wedderburn, R.W. (1972). Generalized linear models. J. Royal Statist. Soc. B., 135, 370-384.

New Names For Statistical Methods

Statisticians and Computer Scientists have done a pretty poor job of thinking of names for procedures. Names are important. No one is going to use a method called “the Stalin-Mussolini Matrix Completion Algorithm.” But who would pass up the opportunity to use the “Schwarzenegger-Shatner Statistic.” So, I have decided to offer some suggestions for re-naming some of our procedures. I am open to further suggestions.

Bayesian Inference. Bayes did use his famous theorem to do a calculation. But it was really Laplace who systematically used Bayes’ theorem for inference.
New Name: Laplacian Inference.

Bayesian Nets. A Bayes nets is just a directed acyclic graph endowed with probability distribution. This has nothing to do with Bayesian — oops, I mean Laplacian — inference. According to Wikipedia, it was Judea Pearl who came up with the name.
New Name: Pearl Graph.

The Bayes Classification Rule. Give {(X,Y)}, with {Y\in \{0,1\}}, the optimal classifier is to guess that {Y=1} when {P(Y=1|X=x)\geq 1/2} and to guess that {Y=0} when {P(Y=1|X=x)< 1/2}. This is often called the Bayes rule. This is confusing for many reasons. Since this rule is a sort of gold standard how about:
New Name: The Golden Rule.

Unbiased Estimator. Talk about a name that promises more than it delivers.
New Name: Mean Centered Estimator.

Credible Set. This is a set with a specified posterior probability content such as: here is a 95 percent credible set. Might as well make it sound more exciting.
New Name: Incredible Set.

Confidence Interval. I am tempted to suggest “Uniform Frequency Coverage Set” but that’s clumsy. However it does yield a good acronym if you permute the letter a bit.
New Name: Coverage Set.

The Bootstrap. If I remember correctly, Brad Efron considered several names and John Tukey suggested “the shotgun.” Brad, you should have listened to Tukey.
New Name: The Shotgun.

Causal Inference. For some reason, whenever I try to type “causal” I end up typing “casual.” Anyway, the mere mention of causation upsets some people. Some people call causal inference “the analysis of treatment effects” but that’s boring. I suggest we go with the opposite of casual:
New Name: Formal Inference.

The Central Limit Theorem. Boring! For historical reasons I suggest:
de Moivre’s Theorem.

The Law of Large Numbers. Another boring name. Again, to respect history I suggest:
New Name: Bernoulli’s Theorem.

Minimum Variance Unbiased Estimator. Let’s just eliminate this one.

The lasso. Nice try Rob, but most people don’t even know what it stands for. How about this:
New Name: the Taser. (Tibshirani’s Awesome Sparse Estimator for regression).

Stigler’s law of eponymy. If you don’t know what this is, check it out on Wikipedia. The you’ll understand why it name should be:
New Name: Stigler’s law of eponymy.

Neural nets. Let’s call them what they are.
(Not so) New name: Nonlinear regression.

p-values. I hope you’ll agree that this is a less than inspiring name. The best I can come up with is:
New Name: Fisher Statistic.

Support Vector Machines. This might get the award for the worst name ever. Sounds like some industrial device in a factory. Since we already like the acronym VC, I suggest:
New Name: Vapnik Classifier.

U-statistic. I think this one is obvious.
New Name: iStatistic.

Kernels. In statistics, this refers to a type of local smoothing, such as kernel density estimation and Nadaraya-Watson kernel regression. Some people use “Parzen Window” which sounds like something you buy when remodeling your house. But in Machine Learning it is used to refer to Mercer kernels with play a part in Reproducing Kernel Hilbert Spaces. We don’t really need new names we just need to clarify how we use the terms:
New Usage: Smoothing Kernels for density estimators etc. Mercer kernels for kernels that generate a RKHS.

Reproducing Kernel Hilbert Space. Saying this phrase is exhausting. The acronym RKHS is not much better. If we used history as a guide we’d say Aronszajn-Bergman space but that’s just as clumsy. How about:
New Name: Mercer Space.

0. No constant is used more than 0. Since no one else has ever names it, this is my chance for a place in history.
New Name: Wasserman’s Constant.

FLAT PRIORS IN FLATLAND: STONE’S PARADOX

FLAT PRIORS IN FLATLAND: STONE’S PARADOX

Mervyn Stone is Emeritus Professor at University College London. He is famous for his work on Bayesian inference as well as pioneering work on cross-validation, coordinate-free multivariate analysis, as well as many other topics.

Today I want to discuss a famous example of his, described in Stone (1970, 1976, 1982). In technical jargon, he shows that “a finitely additive measure on the free group with two generators is nonconglomerable.” In English: even for a simple problem with a discrete parameters space, flat priors can lead to surprises. Fortunately, you don’t need to know anything about free groups to understand this example.

1. Hunting For a Treasure In Flatland

I wonder randomly in a two dimensional grid-world. I drag an elastic string with me. The string is taut: if I back up, the string leaves no slack. I can only move in four directions: North, South, West, East.

I wander around for a while then I stop and bury a treasure. Call the path {\theta}. Here is an example:

flatland1

Now I take one more random step. Each direction has equal probability. Call the final path {x}. So it might look like this:

flatland2

Two people, Bob (a Bayesian) and Carla (a classical statistician) want to find the treasure. There are only four possible paths that could have yielded {x}, namely:

flatland3

Let us call these four paths N, S, W, E. The likelihood is the same for each of these. That is, {p(x|\theta) = 1/4} for {\theta\in \{N , S, W , E\}}. Suppose Bob uses a flat prior. Since the likelihood is also flat, his posterior is

\displaystyle  P(\theta = N|x) = P(\theta = S|x) = P(\theta = W|x) = P(\theta = E|x) = \frac{1}{4}.

Let {B} be the three paths that extend {x}. In this example, {B = \{N,W,E\}}. Then {P(\theta\in B|x) = 3/4}.

Now Carla is very confident and selects a confidence set with only one path, namely, the path that shortens {x}. In other words, Carla’s confidence set is {C=B^c}.

Notice the following strange thing: no matter what {\theta} is, Carla gets the treasure with probability 3/4 while Bob gets the treasure with probability 1/4. That is, {P(\theta\in B|x) = 3/4} but the coverage of {B} is 1/4. In other words, {P(\theta\in B|\theta) =1/4} for every {\theta}. On the other hand, the coverage of {C} is 3/4: {P(\theta\in C|\theta) = 3/4} for every {\theta}.

Here is quote from Stone (1976): (except that I changed his B and C to Bob and Carla):

“ … it is clear that when Bob and Carla repeatedly engage in this treasure hunt, Bob will find that his posterior probability assignment becomes increasingly discrepant with his proportion of wins and that Carla is, somehow, doing better than [s]he ought. However, there is no message … that will allow Bob to escape from his Promethean situation; he cannot learn from his experience because each hunt is independent of the other.”

2. More Trouble For Bob

Let {A} be the event that the final step reduces the length of the string. Using his posterior distribution, Bob finds that {P(A|x) = 3/4} for each {x}. Since this holds for each {x}, Bob deduces that {P(A)=3/4}.

On the other hand, Bob notes that {P(A|\theta)=1/4} for every {\theta}. Hence, {P(A) = 1/4}.

Bob has just proved that {3/4 = 1/4}.

3. The Source of The Problem

The apparent contradiction stems from the fact that the prior is improper. Technically this is an example of the non-conglomerability of finitely additive measures. For a rigorous explanation of why this happens you should read Stone’s papers. Here is an abbreviated explanation, from Kass and Wasserman (1996, Section 4.2.1).

Let {\pi} denotes Bob’s improper flat prior and let {\pi(\theta|x)} denote his posterior distribution. Let {\pi_p} denote the prior that is uniform on the set of all paths of length {p}. This is of course a proper prior. For any fixed {x}, {\pi_p(A|x) \rightarrow 3/4} as {p\rightarrow \infty}. So Bob can claim that his posterior distribution is a limit of well-defined posterior distributions. However, we need to look at this more closely. Let {m_p(x) = \sum_\theta f(x|\theta)\pi_p(\theta)} be the marginal of {x} induced by {\pi_p}. Let {X_p} denote all {x}‘s of length {p} or {p+1}. When {x\in X_p}, {\pi_p(\theta|x)} is a poor approximation to {\pi(\theta|x)} since the former is concentrated on a single point while the latter is concentrated on four points. In fact, the total variation distance between {\pi_p(\theta|x)} and {\pi(\theta|x)} is 3/4 for {x\in X_p}. (Recall that the total variation distance between two probability measures {P} and {Q} is {d(P,Q) = \sup_A |P(A)-Q(A)|}.) Furthermore, {X_p} is a set with high probability: {m_p(X_p)\rightarrow 2/3} as {p\rightarrow \infty}.

While {\pi_p(\theta|x)} converges to {\pi(\theta|x)} as {p\rightarrow\infty} for any fixed {x}, they are not close with high probability.

This problem disappears if you use a proper prior.

4. The Four Sided Die

Here is another description of the problem. Consider a four sided die whose sides are labeled with the symbols {\{a,b,a^{-1},b^{-1}\}}. We roll the die several times and we record the label on the lowermost face (there is a no uppermost face on a four-sided die). A typical outcome might look like this string of symbols:

\displaystyle  a\ \ a\ b\ a^{-1}\ b\ b^{-1}\ b\ a\ a^{-1}\ b

Now we apply an annihilation rule. If {a} and {a^{-1}} appear next to each other, we eliminate these two symbols. Similarly, if {b} and {b^{-1}} appear next to each other, we eliminate those two symbols. So the sequence above gets reduced to:

\displaystyle  a\ \ a\ b\ a^{-1}\ b\ b

Let us denote the resulting string of symbols, after removing annihilations, by {\theta}. Now we toss the die one more time. We add this last symbol to {\theta} and we apply the annihilation rule once more. This results in a string which we will denote by {x}.

You get to see {x} and you want to infer {\theta}.

Having observed {x}, there are four possible values of {\theta} and each has the same likelihood. For example, suppose {x =(a,a)}. Then {\theta} has to be one of the following:

\displaystyle  (a),\ \ (a\,a\,a),\ \ (a\,a\,b^{-1}),\ \ (a\,a\,b)

The likelihood function is constant over these four values.

Suppose we use a flat prior on {\theta}. Then the posterior is uniform on these four possibilities. Let {B = B(x)} denote the three values of {\theta} that are longer than {x}. Then the posterior satisfies

\displaystyle  P(\theta\in B|x) = 3/4.

Thus {B(x)} is a 75 percent posterior confidence set.

However, the frequentist coverage of {B(x)} is 1/4. To see this, fix any {\theta}. Now note that {B(x)} contains {\theta} if and only if {\theta} concatenated with {x} is smaller than {\theta}. This happens only if the last symbol is annihilated, which occurs with probability 1/4.

5. Likelihood

Another consequence of Stone’s example is that, in my opinion, it shows that the Likelihood Principle is bogus. According to the likelihood principle, the observed likelihood function contains all the useful information in the data. In this example, the likelihood does not distinguish the four possible parameter values.

But the direction of the string from the current position — which does not affect the likelihood — clearly has lots of information.

6. Proper Priors

If you want to have some fun, try coming up with proper priors on the set of paths. Then simulate the example, find the posterior and try to find the treasure.

Better yet, have a friend simulate the a path. Then you choose a prior, compute the posterior and guess where the treaure is. Repeat the game many times. Your friend generates a different path every time. If you try this, I’d be interested to hear about the simulation results.

Another question this example raises is: should we ever use improper priors? Flat priors that do not have mass can be interpreted as finitely additive priors. The father of Bayesian inference, Bruno DeFinetti, was adamant in rejecting the axiom of countable additivity. He thought flat priors like Bob’s were fine.

It seems to me that in modern Bayesian inference, there is not universal agreement on whether flat priors are evil or not. In some cases they work fine in others they don’t. For example, poorly chosen improper priors in random effects models can lead to improper (non-integrable) posteriors. But other improper priors don’t cause this problem.

In Stone’s example I think that most statisticians would reject Bob’s flat prior-based Bayesian inference.

7. Conclusion

I have always found this example to be interesting because it seems very simple and, at least at first, one doesn’t expect there to be a problem with using a flat prior. Technically the problems arise because there is group structure and the group is not amenable. Hidden beneath this seemingly simple example is some rather deep group theory.

Many of Stone’s papers are gems. They are not easy reading (with the exception of the 1976 paper) but they are worth the effort.

8. References

Stone, M. (1970). Necessary and sufficient condition for convergence in probability to invariant posterior distributions. The Annals of Mathematical Statistics, 41, 1349-1353,

Stone, M. (1976). Strong inconsistency from uniform priors. Journal of the American Statistical Association, 71, 114-116.

Stone, M. (1982). Review and analysis of some inconsistencies related to improper priors and finite additivity. Studies in Logic and the Foundations of Mathematics, 104, 413-426.

Kass, R.E. and Wasserman, L. (1996). The selection of prior distributions by formal rules. Journal of the American Statistical Association, 91, 1343-1370.

Nate Silver is a Frequentist: Review of “the signal and the noise”

Nate Silver Is A Frequentist
Review of “the signal and the noise” by Nate Silver

There are not very many self-made statisticians, let alone self-made statisticians who become famous and get hired by the New York Times. Nate Silver is a fascinating person. And his book the signal and the noise, is a must read for anyone interested in statistics.

The book is about prediction. Silver chronicles successes and failures in the art of prediction and he does so with clear prose and a knack for good storytelling.

Along the way, we learn about his unusual life path. He began as an economic consultant for KPMG. But his real passion was predicting player performance in baseball. He developed PETOCA, a statistical baseball analysis system which earned him a reputation as a crack forecaster. He quit his day job and made a living playing online poker. Then he turned to political forecasting, first at the Daily Kos and later at his own website, FiveThirtyEight.com. His accurate predictions drew media attention and in 2010 he became a blogger and writer for the New York Times.

The book catalogues notable successes and failures in prediction. The first topic is the failure of ratings agencies to predict the bursting of the housing bubble. Actually, the bursting of the bubble was predicted, as Silver points out. The problem was that Moody’s and Standard and Poor’s either ignored or downplayed the predictions. He attributes to failure to having too much confidence in their models and not allowing for outliers. Basically, he claims, they confused good “in-sample prediction error” as being the same as “good out-of-sample prediction error.”

Next comes a welcome criticism of bogus predictions from loud-mouthed pundits on news shows. Then, a fun chapter on how he used relatively simple statistical techniques to become a crackerjack baseball predictor. This is a theme that Silver touches on several times. If you can find a field that doesn’t really on statistical techniques, you can become a star just by using some simple, common sense methods. He attributes his success at online poker, not to his own acumen, but to the plethora of statistical dolts who were playing online poker at the time.

He describes weather forecasting as a great success detailing the incremental, painstaking improvements that have taken place over many years.

One of the striking facts about the book is the emphasis the Silver places on frequency calibration. (I’ll have more to say on this shortly.) He draws a plot of observed frequency versus forecast probability for the National Weather Service. The plot is nearly a straight line. In other words, of the days that the Weather Service said there was a 60 percent chance of raining, it rained 60 percent of the time.

Interestingly, the calibration plot for the Weather Channel shows a bias at the lower frequencies. Apparently, this is intentional. The loss function for the Weather Channel is different than the loss function for the Nation Weather Service. The latter wants accurate (calibrated) forecasts. The Weather Channel wants accuracy too, but they also want to avoid making people annoyed. It is in their best interests to over-predict rain slightly for obvious reasons: if they predict rain and it turns out to be sunny, no big deal. But if they predict sunshine and it rains, people get mad.

Next come earthquake predictions and economic predictions. He rates both as duds. He goes on to discuss epidemics, chess, gambling, the stock market, terrorism, and climatology. When discussing the accuracy of climatology forecasts he is way too forgiving (a bit of political bias?). More importantly, he ignores the fact that developing good climate policy inevitably involves economic prediction, to which he already gave a failing grade. (Is it better to spend a trillion dollars helping Micronesia develop a stronger economy so they don’t rely so much on farming close to the shore, or to spend the money on reducing carbon output and hence delay rising sea levels by two years? Climate policy is inextricably tied to economics.)

Every chapter has interesting nuggets. I especially liked the chapter on computer chess. I knew that Deep Blue beat Gary Kasparov but beyond that, I didn’t know much. The book gives lots of juicy details.

As you can see, I liked the book very much and I highly recommend it.

But …

I have one complaint. Silver is a big fan of Bayesian inference, which is fine. Unfortunately, he falls into that category I referred to a few posts ago. He confuses “Bayesian inference” with “using Bayes’ theorem.” His description of frequentist inference is terrible. He seems to equate frequentist inference with Fisherian significance testing, most using Normal distributions. Either he learned statistics from a bad book or he hangs out with statisticians with a significant anti-frequentist bias.

Have no doubt about it: Nate Silver is a frequentist. For example, he says:

“One of the most important tests of a forecast — I would argue that it is the single most important one — is called calibration. Out of all the times you said there was a 40 percent chance of rain, how often did rain actually occur? If over the long run, it really did rain about 40 percent of the time, that means your forecasts were well calibrated.”

It does not get much more frequentist than that. And if using Bayes’ theorem helps you achieve long run frequency calibration, great. If it didn’t, I have no doubt he would have used something else. But his goal is clearly to have good long run frequency behavior.

This theme continues throughout the book. Here is another quote from Chapter 6:

“A 90 percent prediction interval, for instance, is supposed to cover 90 percent of the possible real-world outcomes, … If the economists’ forecasts were as accurate as they claimed, we’d expect the actual value for GDP to fall within their prediction interval nine times out of then …”

That’s the definition of frequentist coverage. In Chapter 10 he does some data analysis on poker. He uses regression analysis with some data-splitting. No Bayesian stuff here.

I don’t know if any statisticians proof-read this book but if they did, it’s too bad they didn’t clarify for Silver what Bayesian inference and frequentist inference really are.

But perhaps I am belaboring this point too much. This is meant to be a popular book, after all, and if it helps to make statistics seem cool and important, then it will have served an important function.

So try not to be as pedantic as me when reading the book. Just enjoy it. I used to tell people at parties that I am an oil-fire fighter. Now I’ll say: “I’m a statistician. You know. Like that guy Nate Silver.” And perhaps people won’t walk away.

Screening and False Discovery Rates

Today we have another guest post. This one is by Ryan Tibshirani an Assistant Professor in my department. (You might also want to check out the course on Optimization that Ryan teaches jointly with Geoff Gordon.)

Screening and False Discovery Rates
by Ryan Tibshirani

Two years ago, as a TA for Emmanuel Candes’ Theory of Statistics course at Stanford University, I posed a question about screening and false discovery rates on a class homework. Last year, for my undergraduate Data Mining course at CMU, I re-used the same question. The question generated some interesting discussion among the students, so I thought it would be fun to share the idea here. Depending on just how popular Larry’s blog is (or becomes), I may not be able to use it again for this year’s Data Mining course! The question is inspired by conversations at the Hastie-Tibs-Taylor group meetings at Stanford.

Consider a two class problem, genetics inspired, with {n} people and {m} gene expression measurements. The people are divided into two groups: {n/2} are healthy, and {n/2} are sick. We have {m_0} null genes (in which there is actually no underlying difference between healthy and sick patients), and {m-m_0} non-nulls (in which there actually is a difference). All measurements are independent.

Suppose that we compute a two sample {t}-statistic {t_j} for each gene {j=1,\ldots m}. We want to call some genes significant, by thresholding these {t}-statistics (in absolute value); we then want to estimate the false discovery rate (FDR) of this thresholding rule, which is

\displaystyle  \mathrm{FDR} = \mathop{\mathbb E}\left[\frac{number\  of\  null\  genes\  called\  significant}{number\ of\  genes\  called\  significant}\right].

The Benjamini-Hochberg (BH) procedure provides a way to do this, which is best explained using the {p}-values {p_1,\ldots p_m} from {t_1,\ldots t_m}, respectively. We first sort the {p}-values {p_{(1)} \leq \ldots \leq p_{(m)}}; then, given a level {q}, we find the largest {k} such that

\displaystyle  p_{(k)} <= q \frac{k}{m},

and call the {p}-values {p_{(1)},\ldots p_{(k)}} (and the corresponding genes) significant. It helps to think of this as a rule which rejects all {p}-values {p_1,\ldots p_m} satisfying {p_j \leq c}, for the cutoff {c= p_{(k)}}. The BH estimate for the FDR of this rule is simply {q}.

An alternative procedure to estimate the FDR uses a null distribution generated by permutations. This means scrambling all of the group labels (healthy/sick) uniformly at random, and then recomputing the {t}-statistics. Having done this {B} times, we let {t^{(i)}_1,\ldots t^{(i)}_m} denote the {t}-statistics computed on the {i}th permuted data set. Now consider the rule that rejects all {t}-statistics {t_1,\ldots t_m} satisfying {|t_j|>c} for some cutoff {c}. The permutation estimate for the FDR of this rule is

\displaystyle  \frac{\frac{1}{B}\sum_{i=1}^B\sum_{j=1}^m 1\{|t^{(i)}_j|>c\}} {\sum_{j=1}^m 1\{|t_j| > c\}} =   \frac{\ average\ number\   of\  null\  genes\  called\  significant\  over\  permutations}  {number\ of\  genes\  called\  significant\   in\  original\  data\  set}

How good are these estimates? To answer this, we’ll look at a simulated example, in which we know the true FDR. Here we have {n=200} patients and {m=2000} genes, {m_0=1900} of which are null. The gene expression measurements are all drawn independently from a standard normal distribution with mean zero, except for the non-null genes, where the mean was chosen to be -1 or 1 (with equal probability) for the sick patients. The plot below shows the estimates as we vary the cutoff {c} (for the BH procedure, this means varying the level {q}) versus the true FDR, averaged over 10 simulated data sets. Both estimates look quite accurate, with the BH estimate being a little conservative.

no-screening

Now what happens if, before computing these estimates, we restricted our attention to a small group of genes that looked promising in the first place? Specifically, suppose that we screened for genes based on high between-group variance (between the healthy and sick groups). The idea is to only consider the genes for which there appears to a difference between the healthy and sick groups. Turning to our simluated example, we kept only 1000 of the 2000 genes with the highest between-group variance, and then computed the BH and permutation estimates as usual (as if we were given this screened set to begin with). The plot below shows that the FDR estimates are now quite bad, as they’re way too optimistic.

bv-screening

Here is the interesting part: if we screen by total variance (the variance of all gene expression measurements, pooling the healthy and sick groups), then this problem goes away. The logic behind screening by total variance is that, if there’s not much variability overall, then there’s probably no interesting difference between the healthy and sick groups. In our simulated example, we kept only 1000 of the 2000 genes with the highest total variance, and computed the BH and permutation estimates as usual. We can see below that both estimates of the FDR are actually pretty much as accurate as they were in the first place (with no screening performed), if not a little more conservative.

tv-screening

Why do you think that the estimates after screening by between-group variance and by total variance exhibit such different behaviors? I.e., why is it OK to screen by total variance but not by between-group variance? I’ll share my own thoughts in a future post.

The Density Cluster Tree: A Guest Post

Today, we have a guest post by Sivaraman Balakrishnan. Siva is a graduate student in the School of Computer Science. The topic of his post is an algorithm for clustering. The algorithm finds the connected components of the level sets of an estimate of the density. The algorithm — due to Kamalika Chaudhuri and Sanjoy DasGupta— is very simple and comes armed with strong theoretical guarantees.

Aside: Siva’s post mentions John Hartigan. John is a living legend in the field of statistics. He’s done fundamental work on clustering, Bayesian inference, large sample theory, subsampling and many other topics. It’s well worth doing a Google search and reading some of his papers.

Before getting to Siva’s post, here is a picture of a density and some of its level set clusters. The algorithm Siva will describe finds the clusters at all levels (which then form a tree).

THE DENSITY CLUSTER TREE
by
SIVARAMAN BALAKRISHNAN

1. Introduction

Clustering is widely considered challenging both practically and theoretically. One of the main reasons for this is that often the true goals of clustering are not clear and this makes clustering seem poorly defined.

One of the most concrete and intuitive ways to define clusters when data are drawn from a density {f} is on the basis of level sets of {f}, i.e. for any {\lambda} the connected components of

\displaystyle  \{x: f(x) \geq \lambda\}

form the clusters at level {\lambda}. This leaves the question of how to select the “correct” {\lambda}, and typically we simply sweep over {\lambda} and present what is called the density cluster tree.

However, we usually do not have access to {f} and would like to estimate the cluster tree of {f} given samples drawn from {f}. Recently, Kamalika Chaudhuri and Sanjoy Dasgupta (henceforth CD) presented a simple estimator for the cluster tree in a really nice paper: Rates of convergence for the cluster tree (NIPS, 2010) and showed it was consistent in a certain sense.

This post is about the notion of consistency, the CD estimator, and its analysis.

2. Evaluating an estimator: Hartigan’s consistency

The first notion of consistency for an estimated cluster tree was introduced by J.A. Hartigan in his paper: Consistency of single linkage for high-density clusters (JASA, 1981).

Given some estimator {\Theta_n} of the cluster tree of {f} (i.e. a collection of hierarchically nested sets), we say it is consistent if:

For any sets {A,A^\prime \subset \mathbb{R}^d}, let {A_n} (respectively {A^\prime_n}) denote the smallest cluster of {\Theta_n} containing the samples in {A} (respectively {A}). {\Theta_n} is consistent if, whenever {A} and {A^\prime} are different connected components of {\{x : f(x) \geq \lambda \}} (for some {\lambda > 0}), {P(A_n} is disjoint from {A^\prime_n) \rightarrow 1} as {n \rightarrow \infty}.

Essentially, we want that if we have two separated clusters at some level {\lambda} then the cluster tree must reflect this, i.e. the smallest clusters containing the samples from each of these clusters must be disconnected from each other.

To give finite sample bounds, CD introduced a notion of saliently separated clusters and showed that these clusters can be identified using a small number of samples (as a by-product their results also imply Hartigan consistency for their estimators). Informally, clusters are saliently separated if they satisfy two conditions.

  1. Separation in {\mathbb{R}^d}: We would expect that clusters that are two close cannot be identified in a finite sample.
  2. Separation in the density {f}: There should a sufficiently big region of low density separating the clusters. Again, we would expect that if the “bridge” between the clusters doesn’t dip enough then we might (incorrectly) conclude they are the same cluster from a finite sample.

3. An algorithm

The CD estimator is based on the following algorithm:

  1. INPUT: {k}

  2. For {r = [0,\infty)}, discard all points with {r_k(x) > r} where {r_k(x)} is the distance to the {k^{\rm th}} nearest neighbor of {x}. Connect {x,y} if {||x-y|| \leq \alpha r}, to form {G(r)}.

  3. OUTPUT: Return connected components of {G(r)}.

CD show that their estimator is consistent (and give finite sample rates for saliently separated clusters) if we select {k \sim d \log n} and {\alpha \geq \sqrt{2}}.

It is actually true that any density estimate that is uniformly close to the true density can be used to construct a Hartigan consistent estimator. However, this involves finding the connected components of the level sets of the estimator, which can be hard. The nice thing about the CD estimator is that it is completely algorithmic.

3.1. Detour 1: Single linkage

Single linkage is a popular linkage clustering algorithm, and essentially corresponds to the case of {\alpha = 1, k = 2}. Given its popularity an important question to ask is whether the single linkage tree is Hartigan consistent. This was answered by Hartigan in his original paper, affirmatively for {d=1} but negatively for {d > 1}.

The main issue is an effect called “chaining” which causes single linkage to merge clusters before fully connecting up the clusters within themselves. The reason for this is that single-linkage is not sufficiently sensitive to the density separation, i.e. even if there is a region of low density between two clusters single linkage might form a “chain” across it because it is mostly oblivious to the density of the sample.

Returning to the CD estimator: one intuitive way to understand the estimator is to observe that for a fixed {r}, the first step discards points on the basis of their distance to their {k}-th NN. This is essentially cleaning the sample to remove points in regions of low-density (as measured by a {k}-NN density estimate). This step makes the algorithm density sensitive and prevents chaining.

4. Analysis

So how does one analyze the CD estimator? We essentially need to show that for any two saliently separated clusters at a level {\lambda}, there is some radius {r} at which:

  1. We do not clean out any points in the clusters.
  2. The clusters are internally connected at the radius {\alpha r}.
  3. The clusters are mutually separated at this radius.

To establish each of these we will first need to understand how the {k}-NN distance of the sample points behave given a finite sample.

4.1. Detour 2: Uniform large deviation inequalities

As before we are given {n} random samples {\{X_1, \ldots, X_n\}} from a density {f} on {\mathbb{R}^d}. Let’s say we are interested in a measurable subset {A} of the space {\mathbb{R}^d}. A fundamental question is: How close is the empirical mass of {A} to the true mass of {A}? i.e. we would like to relate the quantities {f_n(A) = n^{-1}\sum_{i=1}^n \mathbb{I} (X_i \in A)} and {f(A)\equiv \int_A f(x)dx}. Notice that this is essentially the same as the question: if I toss a coin with bias {f(A)}, {m} times, on average how many heads will I see?

A standard way to answer these questions quantitatively is using a large deviation inequality like Hoeffding’s inequality. Often we have multiple sets {\mathcal{A} = \{A_1, \ldots, A_N\}} and we’d like to relate {f_n(A_i)} to {f(A_i)} for each of these sets. One quantity we might be interested in is

\displaystyle \Delta = \sup_i | f_n(A_i) - f(A_i)|

and quantitative estimates of {\Delta} are called uniform convergence results.

One surprising fact is that even for infinite collections of sets {\mathcal{A}} we can sometimes still get good bounds on {\Delta} if we can control the complexity of {\mathcal{A}}. Typical ways to quantify this complexity are things like covering numbers, VC dimension etc.

To conclude this detour here is an example of this in action. Let {\mathcal{A}} be the collection of all balls (with any center and radius) in {\mathbb{R}^d}, if {k \geq d \log n} then with probability {1-\delta}, for any {B \in \mathcal{A}}

\displaystyle  \begin{array}{rcl}  f(B) \geq \frac{k}{n} + C \log (1/\delta) \frac{\sqrt{kd \log n}}{n} \implies f_n(B) \geq \frac{k}{n} \\ f(B) \leq \frac{k}{n} - C \log (1/\delta) \frac{\sqrt{kd \log n}}{n} \implies f_n(B) < \frac{k}{n} \end{array}

The inequalities above are uniform convergence inequalities over the sets of all balls in {\mathbb{R}^d}. Although this is an infinite collection of sets it has a small VC dimension of {d+1}, and this lets us uniformly relate the true and empirical mass of each of these sets.

In the context of the CD estimator what this detour assures us is that {k}-NN distance of the every sample point is close to the “true” {k}-NN distance. This lets us show that for an appropriate radius we will not remove any point from a high-density cluster (because it will have a small {k}-NN distance) or that we will remove all points in the low-density region that separates salient clusters (because they will have large {k}-NN distance). Results like this let us establish consistency of the CD estimator.

References.

Chaudhuri, K. and DasGupta, S. (2010). Rates of convergence for the cluster tree. Advances in Neural Information Processing Systems, 23, 343–351.

Hartigan, J. (1981). Consistency of single linkage for high-density clusters. Journal of the American Statistical Association, 76, 388-394.

WHAT IS BAYESIAN/FREQUENTIST INFERENCE?

WHAT IS BAYESIAN/FREQUENTIST INFERENCE?

When I started this blog, I said I wouldn’t write about the Bayes versus Frequentist thing. I thought that was old news.

But many things have changed my mind. Nate Silver’s book, various comments on my blog, comments on other blogs, Sharon McGrayne’s book, etc have made it clear to me that there is still a lot of confusion about what Bayesian inference is and what Frequentist inference is.

I believe that many of the arguments about Bayes versus Frequentist are really about: what is the definition of Bayesian inference?

1. Some Obvious (and Not So Obvious) Statements

Before I go into detail, I’ll begin by making a series of statements.

Frequentist Inference is Great For Doing Frequentist Inference.
Bayesian Inference is Great For Doing Bayesian Inference.

Frequentist inference and Bayesian Inference are defined by their goals, not their methods.

A Frequentist analysis need not have good Bayesian properties.
A Bayesian analysis need not have good frequentist properties.

Bayesian Inference {\neq} Using Bayes Theorem

Bayes Theorem {\neq} Bayes Rule

Bayes Nets {\neq} Bayesian Inference

Frequentist Inference is not superior to Bayesian Inference.
Bayesian Inference is not superior to Frequentist Inference.
Hammers are not superior to Screwdrivers.

Confidence Intervals Do Not Represent Degrees of Belief.
Posterior Intervals Do Not (In General) Have Frequency Coverage Properties.

Saying That Confidence Intervals Do Not Represent Degrees of Belief Is Not a Criticism of Frequentist Inference.
Saying That Posterior Intervals Do Not Have Frequency Coverage Properties Is Not a Criticism of Bayesian Inference.

Some Scientists Misinterpret Confidence Intervals as Degrees of Belief.
They Also Misinterpret Bayesian Intervals as Confidence Intervals.

Mindless Frequentist Statistical Analysis is Harmful to Science.
Mindless Bayesian Statistical Analysis is Harmful to Science.

2. The Definition of Bayesian and Frequentist Inference

Here are my definitions. You may have different definitions. But I am confident that my definitions correspond to the traditional definitions used in statistics for decades.

But first, I should say that Bayesian and Frequentist inference are defined by their goals not their methods.

The Goal of Frequentist Inference: Construct procedure with frequency guarantees. (For example, confidence intervals.)

The Goal of Bayesian Inference: Quantify and manipulate your degrees of beliefs. In other words, Bayesian inference is the Analysis of Beliefs.

(I think I got the phrase, “Analysis of Beliefs” from Michael Goldstein.)

My point is that “using Bayes theorem” is neither necessary or sufficient for defining Bayesian inference. A frequentist analysis could certainly include the use of Bayes’ theorem. And conversely, it is possible to do Bayesian inference without using Bayes’ theorem (as Michael Goldstein, for example, has shown). Let me summarize this point in a table:

Fairly soon I am going to post a review of Nate Silver’s new book. (Short review: great book. Buy it and read it.) As I will discuss in that review, Nate argues forcefully that Bayesian analysis is superior to Frequentist analysis. But then he spends most of the book assessing predictions by how good their frequency properties are. For example, he says that a weather forecaster is good if it rains 95 percent of the times he says there is a 95 percent chance of rain. In others, he loves to use Bayes’ theorem but his goals are overtly frequentist. I’ll say more about this in my review of his book. I use it here as an example of how one can be a user of Bayes theorem and still have frequentist goals.

3. Coverage

An example of a frequency guarantee is coverage. Let {\theta = T(P)} be a function of a distribution {P}. Let {{\cal P}} be a set of distributions. Let {X_1,\ldots, X_n \sim P} be a sample from some {P\in {\cal P}}. Finally, let {C_n = C(X_1,\ldots,X_n)} be a set valued mapping. Then {C_n} has coverage {1-\alpha} if

\displaystyle  \inf_{P\in {\cal P}}P^n( T(P) \in C_n) \geq 1-\alpha

where {P^n} is the {n}-fold product measure defined by {P}.

We say that {C_n} is a {1-\alpha} confidence set if it has coverage {1-\alpha}. A Bayesian {1-\alpha} posterior set will not (in general) have coverage {1-\alpha}. This is not a criticism of Bayesian inference, although anytime I mention this point, some people seem to take it that way. Bayesian inference is about the Analysis of Beliefs; it makes no claims about coverage.

I think there would be much less disagreement and confusion if we used different symbols for frequency probabilities and degree-of-belief probabilities. For example, suppose we used {{\sf Fr}} for frequentist statements and {{\sf Bel}} for degree-of-belief statements. Then the fact that coverage and posterior probability are different would be written

\displaystyle  {\sf Fr}_\theta(\theta\in C_n) \neq {\sf Bel}(\theta \in C_n|X_1,\ldots,X_n).

Unfortunately, we use the same symbol {P} for both in which case the above statement becomes

\displaystyle  P_\theta(\theta\in C_n) \neq P(\theta \in C_n|X_1,\ldots,X_n)

which, I think, just makes things confusing.

Of course, there are cases where Bayes and Frequentist methods agree, or at least, agree approximately. But that should not lull us into ignoring the conceptual differences.

4. Examples

Here are a couple of simple examples.

Example 1. Let {X_1,\ldots, X_n \sim N(\theta,1)\equiv P_\theta} and suppose our prior is {\theta \sim N(0,1)}. Let {B_n} be the equi-tailed 95 percent Bayesian posterior interval. Here is a plot of the frequentist coverage {{\sf Cov}_\theta =P_\theta(\theta\in B_n)} as a function of {\theta}. Note that {{\sf Cov}_\theta} is the frequentist probability that the random interval {B_n} traps {\theta}. ({B_n} is random because it is a function of {X_1,\ldots, X_n}.) Also, plotted is the coverage of the usual confidence interval {C_n=[\overline{X}_n - z_{\alpha/2}/\sqrt{n},\ \overline{X}_n + z_{\alpha/2}/\sqrt{n}]}. This is a constant function, equal to 0.95 for every {\theta}.

Of course, the coverage of {B_n} {{\sf Cov}_\theta} is sometimes higher than {1-\alpha} and sometimes lower. The overall coverage is {\inf_\theta {\sf Cov}_\theta =0} because {{\sf Cov}_\theta} tends to {0} as {|\theta|} increases. At the risk of being very repetitive, this is not meant as a criticism of Bayes. I am just trying to make the difference clear.

Example 2. A {1-\alpha} distribution free confidence interval {C_n} for the median {\theta} of a distribution {P} can be constructed as follows. (This is a standard construction that can be found in any text.) Let {Y_1,\ldots, Y_n \sim P}. Let

\displaystyle  Y_{(1)} \leq Y_{(2)} \leq \cdots Y_{(n)}

denote the order statistics (the ordered values). Choose {k} such that {P(k < B < n-k)\geq 1-\alpha} where {B\sim {\rm Binomial}(n,1/2)}. The confidence interval is {C_n = [Y_{(k+1)},Y_{(n-k)}]}. It is easily shown that

\displaystyle  \inf_P P^n(\theta \in C_n) \geq 1-\alpha

where the infimum is over all distributions {P}. So {C_n} is a {1-\alpha} confidence interval. Here is a plot showing some simulations I did:

The plot shows the first 50 simulations. In the first simulation I picked some distribution {F_1}. Let {\theta_1} be the median of {F_1}. I generated {n=100} observations from {F_1} and then constructed the interval. The confidence interval is the first vertical line. The true value is the dot. For the second simulation, I chose a different distribution {F_2}. Then I generated the data and constructed the interval. I did this many times, each time using a different distribution with a different true median. The blue interval shows the one time that the confidence interval did not trap the median. I did this 10,000 times (only 50 are shown). The interval covered the true value 94.33 % of the time. I wanted to show this plot because, when some texts show confidence interval simulations like this they use the same distribution for each trial. This is unnecessary and it gives the false impression that you need to repeat the same experiment in order to discuss coverage.

How would a Bayesian analyze this problem. The Bayesian analysis of this problem would start with a prior {\pi(P)} on the distribution {P}. This defines a posterior {\pi(P|Y_1,\ldots, Y_n)}. (But the posterior is not obtained via Bayes theorem! There is no dominating measure here. Nonetheless, there is still a well-defined posterior. But that’s a technical point we can discuss another day.) The posterior {\pi(P|Y_1,\ldots, Y_n)} induces a posterior {\pi(\theta|Y_1,\ldots, Y_n)} for the median. And from this we can get a 95 percent Bayesian interval {B_n} say, for the median. The interval {B_n}, of course, depends on the prior {\pi}. I’d love to include a numerical experiment to compare {B_n} and {C_n} but time does not permit. It will make a good homework exercise in a course.

5. Grey Area

There is much grey area between the two definitions I gave. I suspect, for example, that Andrew Gelman would deny being bound by either of the definitions I gave. That’s fine. But I still think it is useful to have clear, if somewhat narrow, definitions to begin with.

6. Identity Statistics

One thing that has harmed statistics — and harmed science — is identity statistics. By this I mean that some people identify themselves as “Bayesians” or “Frequentists.” Once you attach a label to yourself, you have painted yourself in a corner.

When I was a student, I took a seminar course from Art Dempster. He was the one who suggested to me that it was silly to describe a person as being Bayesian of Frequentist. Instead, he suggested that we describe a particular data analysis as being Bayesian of Frequentist. But we shouldn’t label a person that way.

I think Art’s advice was very wise.

7. Failures of Assumptions

I have had several people make comments like: “95 percent intervals don’t contain the true value 95 percent of the time.” Here is what I think they mean. When we construct a confidence interval {C_n} we inevitably need to make some assumptions. For example, we might assume that the data are iid. In practice, these assumptions might fail to hold in which case the confidence interval will not have its advertised coverage. This is true but I think this obscures the discussion.

Both Bayesian and Frequentist inference can fail to achieve their stated goals for a variety of reasons. Failures of assumptions are of great practical importance but they are not criticisms of the methods themselves.

Suppose you apply special relativity to predict the position of a satellite and your prediction is wrong because some of the assumptions you made don’t hold. That’s not a valid criticism of special relativity.

8. No True Value

Some people like to say that it is meaningless to discuss the “true value of a parameter.” No problem. We could conduct this entire conversation in terms of predicting observable random variables instead. This would not change my main points.

9. Conclusion

I’ll close by repeating what I wrote at the beginning: Frequentist inference is great for doing frequentist inference. Bayesian inference is great for doing Bayesian inference. They are both useful tools. The danger is confusing them.

10. Coming Soon On This Blog!

Future posts will include:

-A guest post by Ryan Tibshirani

-A guest post by Sivaraman Balikrishnan

-My review of Nate Silver’s book

-When Does the Bootstrap Work?

-Matrix-Fu, that deadly combination of Matrix Calculus and Kung-Fu.

anti xkcd

anti xkcd

Some of you may have noticed that the recent installment of the always entertaining web comic, xkcd,
had a statistical theme with a decidedly anti-frequentist flavor: see here.

In the interest of balance, here is my
(admittedly crude) attempt at an xkcd style comic.
Right back at you Randall Munroe!

Betting and Elections

Betting and Elections

The winner of yesterday’s election was … statistics.

While bloviating pundits went on and on about how close the election was going to be, some people actually used statistics to forecast the outcome. Perhaps the most famous election quant is Nate Silver. (I’ll be writing more about Nate Silver in a few weeks when I write a post about his book, the signal and the noise. Despite my admiration for Silver, I think he is a bit confused about the difference between Bayesian and frequentist inference. He is a raving frequentist, parading as a Bayesian, as I’ll explain in a few weeks.)

Silver uses all the available polls together with other background information, and then applies statistical methods to combine the information and make predictions. Over at Simply Statistics, there is a nice plot, which I reproduce here:

This plot is from “Simply Statistics”

The plot shows the voting percentage versus Silver’s prediction. Pretty impressive. Of course, Silver wasn’t the only one using statistical methods to make election predictions. See the Washington Post for some more.

Silver caused some controversy when he responded to criticisms of his predictions by offering to bet. Margaret Sullivan, the New York times ombudsman (or, ombudswoman, I guess) criticized Silver for offering the bet. But as Alex Tabarrok argued, offering to bet is a good idea. As Tabarrok puts it:

“A Bet is a Tax on Bullshit”

(This is one of my favorite quotes of the year.) Tabarrok goes on to say: “In fact, the NYTimes should require that Silver, and other pundits, bet their beliefs.”

I agree with this. Imagine if every pundit had to bet part of their salary every time they made a prediction.

I would go a step further and say that every politician should have to put their money where their mouth is. After all, most public policy consists of bets made with other people’s money. If the president thinks that investing in Solyndra is a good bet, then he should have to put up some of his own money.

Betting is a great test of one’s beliefs. I applaud Nate Silver for standing behind his predictions with the offer of a bet. We need more of that.

Edit: My colleague Andrew Thomas has a nice post about this: see
here

Follow

Get every new post delivered to your Inbox.

Join 692 other followers