The recent announcement of the discovery of the Higgs boson brought the inevitable and predictable complaints from statisticians about p-values.

Now, before I proceed, let me say that I agree that p-values are often misunderstood and misused. Nevertheless, I feel compelled to defend our physics friends, and even the journalists, from the p-value police.

The complaints come from frequentists and Bayesians alike. And many of the criticisms are right. Nevertheless, I think we should cease and desist from our p-value complaints.

Some useful commentary on the reporting can be found here, here and and here. For a look at the results, look here.

Here is a list of some of the complaints I have seen together with my comments.

- The most common complaint is that physicists and journalists explain the meaning of a p-value incorrectly. For example, if the p-value is 0.000001 then we will see statements like “there is a 99.9999% confidence that the signal is real.” We then feel compelled to correct the statement: if there is no effect, then the chance of something as or more extreme is 0.000001.
Fair enough. But does it really matter? The big picture is: the evidence for the effect is overwhelming. Does it really matter if the wording is a bit misleading? I think we reinforce our image as pedants if we complain about this.

- The second complaint comes from the Bayesian community that we should be reporting rather than a p-value. Like it or not, frequentist statistics is, for the most part, the accepted way of doing statistics for particle physics. If we go the Bayesian route, what priors will they use? They could report lots of answers corresponding to many priors. But Forbes magazine reports that it cost about 13 billion dollars to find the Higgs. For that price, we deserve a definite answer.
- A related complaint is the people naturally interpret p-values as posterior probabilities so we should use posterior probabilities. But that argument falls apart because we can easily make the reverse argument. Teach someone Bayesian methods and then ask them the following question: how often does your 95 percent Bayesian interval contain the true value? Inevitably they say: 95 percent. The problem is not that people interpret frequentist statements in a Bayesian way. The problem is that they don’t distinguish them. In other words: people naturally interpret frequentist statements in a Bayesian way but they also naturally interpret Bayesian statements in a frequentist way.
- Another complaint I here about p-values is that their use leads to too many false positives. In principle, if we only reject the null when the p-value is small we should not see many false positives. Yet there is evidence that most findings are false. The reasons are clear: non-significant studies don’t get published and many studies have hidden biases.
But the problem here is not with p-values. It is with their misuse. Lots of people drive poorly (sometimes with deadly consequences) but we don’t respond by getting rid of cars.

My main message is: let’s refrain from nit-picking when physicists (or journalists) report a major discovery and then don’t describe the meaning of a p-value correctly.

Now, having said all that, let me add a big disclaimer. I don’t use p-values very often. No doubt they are overused. Indeed, it’s not just p-values that are overused. The whole enterprise of hypothesis testing is overused. But, there are times when it is just the right tool and the search for the Higgs is a perfect example.

—Larry Wasserman

## 39 Comments

I agree, I think P-values get a bad wrap. http://simplystatistics.org/post/15402808730/p-values-and-hypothesis-testing-get-a-bad-rap-but-we

I like it a lot that you mention no. 3. As an addition, even the majority of the statisticians who spot that “how often” is not the right question for a Bayesian interval will think of the “true parameter” as a parameter that is true in some kind of frequentist/propensity sense, i.e., defining some kind of objective sampling distribution.

I like this also, and it follows nicely in your interesting and entertaining response to the Berger, Goldstein debate in Bayesian analysis (in play form).

Shown here: http://ba.stat.cmu.edu/journal/2006/vol01/issue03/wasserman.pdf

… but there is an obvious thing to take issue with. True parameters never become known. Experiments are carried out and results are recorded. If the question is formulated more precisely in terms of experimental outcomes, then only the Bayesian question remains meaningful.

For example if Nostrand’s job was define a series of intervals over N parameters where c is the count observed of future point estimates found within these intervals such that his expectation of c/N = 0.95 then this is of course a Bayesian question. Of course the experiment and the point estimation method would need to be defined in more detail…

I think that it is useful for Bayesians to try to construct credible intervals that have good frequentist coverage (i.e., that can be used as confidence intervals with prescribed characteristics).

It is even possible for an interval constructed on Bayesian principles to have better coverage than the usual frequentist constructions. See

http://bayesrules.net/courses/stat330.2010/berger.pdf

for an example.

Of course, this is rather removed from the p-value discussion.

Thanks for the reference Bill.

The paper that Bill refers to is by Louis Lyons.

A related paper by Louis that is open access is:

http://arxiv.org/abs/0811.1663

—LW

Thanks for the arxiv.org link, Larry.

BTW for the confused, the paper I referred to is the one noted in Comment #9 (below), not my comment to Comment #2 (above)

The p-value police is exactly what we’re seeing: Note the letter from the ISBA: http://errorstatistics.com/2012/07/11/is-particle-physics-bad-science/

The phrase that triggers the most fallacies comes from a slippery slide from “there’s a 5% probability of so extreme a result occurring due to chance” to there’s a probability of .95 it is non-chance—or the like. I’ve seen it even in sophisticated articles on the Higgs, and it’s easy to see what’s happening here.

I only want to comment on (1).

If we are going to twist the truth to make it easier for the public to grasp, why do we have numbers at all? Let’s just say experiment says Higgs boson is there and not talk about percentages.

If we just say the data we have is unlikely, if the Higgs boson wasn’t there, then the conclusion sounds more sketchy. but we could clarify this with the fact that that’s the reason why there were multiple experiments in CERN, And the reason we are looking for it is because the very successful Standard model says so. Which also tells the public how science is done, checking the consistency of results, verifying successful theories every way that we can.

I am quite supportive of this idea. A p-value quantifies the comment, “the data we have is [very] unlikely, if the Higgs boson weren’t there.” Furthermore, it is important that there were multiple, independent experiments.

I understand that the usual “twisting the truth” statements are often attempts to explain the quantitative meaning of the p-value to the lay public; but they are very misleading, and usually just confusing. Accurate explanations are very hard for the lay public to understand, as they would have to include careful discussions of the assumptions on which the p-values are built.

It isn’t even the case that the Higgs experiments were computing the Type I error rate as it is usually defined. This assumes a predetermined and fixed number of samples, which isn’t the case here. As far as I know, there was no attempt to define rules for data-peeking that would allow correction of the alpha-level. See:

http://www.science20.com/quantum_diaries_survivor/blog/keeplooking_bias

It is written by a CERN physicist.

Sorry for getting your name wrong in comment #5. I should have checked before posting.

No problem, Corey. It’s a natural mistake. At meetings I have occasionally been asked if I am responsible for the Jeffreys prior 🙂

Jefferys: But the article to which you link does talk about their deliberately taking into account of the “Keep-Looking Bias”, which he associates with a related selection effect, “sampling to a foregone conclusion”. These adjustments are part of valid statistical significance tests, thus making them Bayesian incoherent. It is part of the error statistical reasoning to determine, post data, the n samples that have actually been observed, where this may demand adjustment for selection effects. Neyman certainly did that in practice, and observed -values reported.

On names, Richard Jeffrey used to tell me the story of a certain statistician who was anxious to meet him, then discovering he wasn’t Jeffreys, saying something like, he’d mistakenly thought he was someone important.

I have a concern about the reported p-value besides the ones discussed above.

The philosophy Mayo propounds in her book “Error and the Growth of Experimental Knowledge” is the only one I’ve ever seen that stands a chance of justifying frequentist practice. She points out that one key difference between Bayesian and frequentist interpretation of data is that correct the calculation of correct frequentist error probabilities depends critically on getting the sample space right. Her example is optional stopping, i.e., sampling until a certain nominal p-value threshold has been met. This sampling plan doesn’t affect the Bayesian posterior distribution because it leaves the likelihood unchanged up to proportionality. It vastly changes the sample space — the type I error rate is much larger than the nominal p-value threshold.

The usual p-value calculation is valid for a sample size fixed in advance. I don’t know for a fact that the reported p-value was calculated under this assumption, but it seems reasonable to suppose so. I do know that the physicists announce results once they reach the 5-sigma level, and this exactly the optional stopping sampling plan! It also seems plausible that the experimenters were monitoring the data during collection and might have done things differently had the data not appeared “right”. So the reported p-value could very well be non-uniform under the null. From a frequentist point of view, the key missing info needed to help interpret the p-value is a report of how the observed sample size compares to the distribution of the sample size under the null and alternative hypotheses. (Now that I’m done writing this, I see that Bill Jeffreys beat me to the punch.)

As a Bayesian, I’m happy to take the given p-value as a statement about the posterior tail area. On its own it’s not very informative; the graphs, on the other hand, are informative enough that I’m willing to affirm that for any reasonable prior, the posterior probability that a new Higgs-like particle has been detected is high.

I feel another thing that the “nitpicking” ignores is that in any given community of scientists, you need some conventions and standards that a mutually intelligible and acceptable to the people that are working in that discipline. There is a standard of “5 sigma” (which corresponds to some p-value) that is accepted by that group as a good standard of evidence (apparently, I’m not a particle physicist). This is a big deal surely. There’s nothing wrong with having other methods that you prefer, and arguing for them, but you can’t ignore the value that having an agreement has for communication and understanding.

Reblogged this on OrdinarySquared and commented:

Interesting discussion on the statistical community’s response to the Higgs discovery and, in particular, the attack of the ‘p-value police’. I agree with the poster that although some of the criticisms are valid, fixation on them should not be allowed to overshadow what seems to be a hugely significant discovery.

“But, there are times when it is just the right tool and the search for the Higgs is a perfect example.”

Perhaps a naive question (but without any hidden agenda, honest!): what is your criterion for deciding when hypothesis testing is the right tool and when it isn’t? Or more concretely, why is this a “perfect” example?

I would like to suggest a criterion. If rejecting the null hypothesis is surprising and scientifically interesting NO MATTER HOW SMALL THE EFFECT, then null-hypothesis-testing via p-values is a valid approach (assuming it is applied correctly). Null-hypothesis-testing (and their associated p-values) are not appropriate when the magnitude of the effect matters. For almost all scientific questions, it is the size of the effect and not its mere existence that is important. For example, in my field (biology) the true population parameters will always be slightly different between populations. It is almost inconceivable that two natural populations would be identical (to 5 or ten decimal places) along a given dimension, and if our sample sizes were large enough, we could detect that difference at whatever p-value we liked. So we know in advance that the null hypothesis will be rejected; the only question is, will we have the resources to take a large enough sample to reject it at some pre-determined level? This is an utterly uninteresting question for science.

Physics Today has an article this month that discusses how this experiment might have been handled in a Bayesian context.

http://www.physicstoday.org/resource/1/phtoad/v65/i7/p45_s1

(Probably has to be downloaded from an academic site, unless you are a member of an APS organization).

By coincidence, there is discussion of a 2.5 sigma result about the stop squark

at Lubos Motl’s blog: http://motls.blogspot.com/2012/07/atlas-25-sigma-stop-squark-excess.html

—LW

Also, a good comment by Carlisle Rainey here:

http://blog.carlislerainey.com/2012/07/13/higgs-boson-and-p-values-a-response-to-wasserman/

Thanks for the link.

For the novice reader; what is the proper answer to “how often does your 95 percent Bayesian interval contain the true value?”

It can be anything from 0 to 100 percent

—LW

Between 0 and 100, depending on how “correct” is the prior ?

Well it’s not easy to give a short answer.

Generally, it will be much less than 95 for

“most” priors. In high-dimensional problems,

it is near 0.

> Generally, it will be much less than 95% for “most” priors

With a uniform prior, there are some cases where the Bayesian interval is exactly the same as the confidence interval. One such example is the mean of normally distributed data, if I remember correctly.

You have to be very careful with this concept of ” ‘most’ priors “. I assume you mean the idea of selecting ‘randomly’ from all priors? I suspect that among the space of ‘priors that are actually used in published research’ (if one allows me to define a set), that the performance is much better than with ‘most priors’?

What I meant was: the coverage is less than 95 percent

except for specially chosen priors.

(Remember that coverage involves an infimum over the whole

parameter space.)

Another well-known issue with p-values is that many (most?) scientists make the mistake of thinking that the t-test measures “how different the means are” when, at best, the p-value is just a measure of how certain one is that the means are different. (Although, it’s worth considering that the null hypothesis is made up of many assumptions, such as that the data is i.i.d. Normal. If the null is rejected, maybe those other parts of the null hypothesis should be questioned?)

I have heard senior academics say that the t-test is “the ‘standard way’ to estimate how different the means are”. Surely, the difference in the sample means is a better estimator; it’s unbiased and consistent, isn’t it? Why divide by the variance, if you are interested in estimating the difference in the means? This is the classic conflation of the magnitude of the effect with the certainty of the effect. And this mistake is quite prevalent within academia – it’s not just a case of careless journalists.

Anyway, regarding (1) and (4), it can be the case that data that looks weird under the null hypothesis also looks pretty weird under any alternative hypothesis. Examples of this are in the classic paper by Berger and Sellke http://dx.doi.org/10.2307/2289131. Sometimes data with a low p-value actually constitutes evidence in *favour* of the null. We have two statements:

A: the data has probability less than p under the null hypothesis.

B: the null is false.

and we often say “A or B” is true. While this statement is correct, we must ask ourselves why there is a tendency to discount A and to believe in B instead. Sometimes, unusual data is just unusual data. Follow up experiments often tell us that A was true, not B.

This sort of issue turns up a lot with the interpretation of many statistical claims. The statements are clearly true, but there is a tendency to overinterpret them. What justifies the discounting of A? Ideally, we want data that looks much weirder under the null than it does under some other hypothesis – this might be more convincing.

Dear Larry, I think you might have misrepresented or oversimplified the nature of Bayesian model comparison. In the case of a Higgs study, all one would need is the proportion between P(data|Theory_Without_Higgs_Particle) on the one hand, and P(data|Theory_With_A_Higgs_Particle) on the other. That would allow for the calculation of a Bayes Factor, and will show the relevant likelihood, which is much more informative than the frequentist P-value. No Prior of the hypotheses would be needed (only for the parameters of these hypotheses, but that is reasonable). Cheers, JP

That is indeed how you compute the Bayes factor.

It does not control the probability of a false positive

(in the frequentist sense), of course.

But the frequentist p-value doesn’t do that either, right?

Idtdoes in the sense that: if you reject whenever p < alpha

then the prob of a type I error is alpha

No, it does not, but this is indeed a tricky one, and a common source of confusion. See Hubbard & Bayarri 2003 for a good explanation. You can download a PDF here: http://ftp.isds.duke.edu/WorkingPapers/03-26.pdf

Dear Larry,

This is an important topic, and I’d like to argue against some of what you’ve said:

1. “If we go the Bayesian route, what priors will they use?”

Reasonable priors give reasonable results, just as reasonable statistical models give reasonable results. If you find that two reasonable priors give you very different conclusions then isn’t it better to acknowledge this uncertainty, rather than hide it under the rug by reporting a p-value? There are default Bayesian hypothesis tests, but a sensitivity analysis is also useful.

2.”They could report lots of answers corresponding to many priors. But Forbes magazine reports that it cost about 13 billion dollars to find the Higgs. For that price, we deserve a definite answer.”

There might be a definite answer. Most rational people (and even some frequentists) would agree that the definite answer is a Bayes factor with a particular prior — when we could all agree on the prior, the rest is just probability theory. Now we can’t all agree on a specific prior perhaps, but that does not mean we have to abandon the correct framework and report something that only bears a fleeting semblance to the notion of Evidence. As has been conclusively demonstrated many times before (Berger & Delampady, 1987; Edwards, Lindman, & Savage, 1963; Sellke, Bayarri, & Berger, 2001; etc etc), p-values overestimate the evidence against the null. For 13 billion dollars, I prefer an uncertain but well-reasoned answer over an answer that is definite but misleading.

3. “A related complaint is the people naturally interpret p-values as posterior probabilities so we should use posterior probabilities. But that argument falls apart because we can easily make the reverse argument. Teach someone Bayesian methods and then ask them the following question: how often does your 95 percent Bayesian interval contain the true value? ”

The difference is this: what researchers *want* to know is the Bayesian result, not the frequentist result. If I’d ask a researcher, after she had completed an experiment, “how often does your 95 percent Bayesian interval contain the true value?”, the answer would be: “I don’t care, this is not what I want to know. I want to know how much support this particular data provide for my alternative hypothesis versus the null hypothesis.” The root of the problem is not that researchers are confused about statistics (they are, of course); the root of the problem is that researchers seek Bayesian answers, not frequentists ones. So there is a good reason why people misinterpret p-values as posterior probabilities: it is posterior probabilities they want, not p-values.

Cheers,

E.J. Wagenmakers

I disagree with this.

I’ll write a longer response when I have more time.

But I think researchers do want 95 percent intervals

to contain the true value 95 percent of the time.

They just don’t realize that Bayes procedures won’t do this.

Thought experiment:

Tell them they can have an answer from envelope A,

generated by a procedure that traps the true value 95 percent of the time,

or envelope B, which uses Bayes but rarely traps the true value.

Put that way, I think most scientists will choose envelope A.

In a few weeks I am going to write a review of Nate Silver’s book.

He takes your point of view and argues strongly for the Bayesian approach.

But then he spends most of the book saying how he wants predictions to be right;

for example, 10 percent of the days you predict rain, it actually rains.

In other words, he is enamored with the idea of Bayes but in the end,

he really wants a frequency guarantee.

I mention this because I think it is typical.

But, your comment deserves a more thorough response which I’ll

try to do in the not so distant future.

Larry

I think it is actually the opposite: most researchers would actually prefer a bayesian confidence interval (often called “credible interval”), giving them a 95% probability that the true value is in that interval, and many researchers therefore misinterpret the frequentist confidence interval as if it gives them that. !

Yes they say they want that, but they don’t understand that it means giving up

coverage.

Ask them to choose between Envelope A or Envelope B so that they can actually

see there is a difference and that they have to make a choice.

I work with some astronomers who often use Bayes but when they do a simulation

and see that they don’t cover 95 percent of the time, they think there is something wrong.

The say they want Bayes, until they really see they don’t get coverage.

Give them a clear stark choice between bayes with low coverage or confidence interval with correct coverage

and see what happens

Hi Larry,

I think researchers want to learn something from the particular data set that they observed. What may happen for other, hypothetical sets of data that have not been observed is less relevant. In other words, it seems to me that inference needs to be conditional on the specific data you observed. Unconditional frequentist procedures may do well *on average*, but we do not observe average data — we observe a particular data set, and performance of unconditional procedures may be good on average but terrible for a particular data set. In recognition of this we might condition on recognizable subsets, ancillary statistics, etc., but problems remain (there are situations for which these do not exist). The Berger & Wolpert 1988 book on the Likelihood Principle lists many examples where frequentist inference is just plain silly (e.g., you can be 100% confident that a parameter lies in a 50% frequentist confidence interval). Smart frequentists can often patch things up, but only in an ad-hoc fashion.

So if I’m a shopkeeper, and I care about how I do on average, across a series of product tests, say, then _perhaps_ a frequentist procedure might be usefully applied. But researchers don’t want to use procedures because they do well on average, in repeated testing: they want to learn something about a very specific situation, about what their data set tells them about their hypotheses.

Also, I should stress that the Sellke et al. (2001) argument is not a Bayesian argument. They just show that the p-value is much less diagnostic than people think, because it neglects the fact that a particular p-value can also be rare under H1. Berger has an applet that makes the point, and it is a purely frequentist argument.

Cheers,

E.J.

Yes these are good points.

I’ll post more on this.

Thanks for the comments.

## 6 Trackbacks

[…] However, Prof. Larry Wasserman argues against my and others' "p-value policing," writing […]

[…] only hugely important thing to do in science (you’d never know from Silver’s book that the Higgs boson was discovered and confirmed using classical frequentist statistical procedures). So Silver here is doing the equivalent of criticizing a car because it can’t fly. Second, […]

[…] and the lack of Bayes in particle physics. Some reactions, with useful links, can be found here, here, here, and here. In this post, I’ll point out what particle physics has to lose by ignoring […]

[…] critics (or the p-value police, as Wasserman called them) maintain that Higgs researchers are misinterpreting their significance levels, correct […]

[…] as details of the statistical analysis trickled down to the media, the P-value police (Wasserman, see (2)) came out in full force to examine if reports by journalists and scientists could in any […]

[…] critics (or the p-value police, as Wasserman called them) maintain that Higgs researchers are misinterpreting their significance levels, correct […]