## anti xkcd

anti xkcd

Some of you may have noticed that the recent installment of the always entertaining web comic, xkcd,

had a statistical theme with a decidedly anti-frequentist flavor: see here.

In the interest of balance, here is my

(admittedly crude) attempt at an xkcd style comic.

Right back at you Randall Munroe!

### Like this:

Like Loading...

*Related*

## 43 Comments

That’s pretty good actually! Very much in the XKCD style ;-)

But if the “experiments” are independent and you conduct N of them so N bayesian 95% “conf. intervals” are created, the probability of none of them contain the “true value” is (1-0.95)^N=0.05^N which asymptotically is zero for N->inf

Isn’t it so? :-)

No! That’s the point. Bayesian intervals have no coverage guarantees.

A Bayesian 95 percent interval does not cover with frequency probability .95.

Oh, you mean that in the Bayesian settings the posteriors are always conditioned on the data so that there are no frequency-based guarantees (irrespective of the sample)? thanks, Stelios

I think in a sense they have coverage guarantees: 95 % credibility interval means by the definition that the true value belongs to this set by 95 % probability. Or rather, when the true value somehow realizes, it will come from this set by 95 % probability. This probability is based on your best knowledge. The fact that you expect the true value to come out from somewhere else reveals that you do not believe in your analysis yourself: either you believe that the data are biased (in which case you have a misspecified sampling model), or you believe that the true value should, somehow, be something else (in which case you have a wrong prior). I.e. if you don’t trust your interval, that shows that you have done something wrong; of course I acknowledge that it may by very difficult or indeed impossible to do everything right.

I will do a whole post on this since, from the comments, it seems

there is some confusions about this issue. Will do it in a few days.

LW

Larry’s right. A 95% credible interval does not necessarily have 95% coverage. It could have a lot more, or it could have a lot less. In the case of the Berger-Mossman article, since Mossman was planning to publish in a journal and field where coverage is considered important, they designed their objective Bayes procedure so that it would have approximately 95% coverage. But that was by design; it doesn’t have to be that way.

Eh, I cannot really vocalize it, but I find myself preferring the xkcd strip…

Frequentist particle mass 95% CIs don’t contain the true mass 95% of the time either: http://hanson.gmu.edu/temp/Henrion-Fischhoff-AmJPhysics-86.pdf

then it’s not a confidence interval

that’s the definition

Thank you for demonstrating my point.

Excellent cartoon! And I notice that it actually relates to Bayesian methods supported by at least some/many subjectivists. Still, while a necessary condition, I would deny that good long-run coverage is all that is required for a satisfactory frequentist inference (I know you agree).

Thanks. Yes I do indeed agree.

Well, according to de Finetti there is no such thing as a “true parameter”. So de Finetti would ask how the first character could ever find out that the true parameter is elsewhere… (and why he’d pay money to a Bayesian if he has a method to find the truth.)

That’s true. But DeFinetti is a bit too “post-modern” for me.

Call me old fashioned, but I think truth exists.

De Finetti would agree that truth exists when it comes to observables.

ok

for DeFinetti’s sake, replace “confidence interval” with

“prediction interval” in the comic.

I’ve been trying to figure out what the fallacy is in the xkcd comic, aside maybe from only giving the frequentist one data point. The only thing I can think is that it’s a poorly designed experiment. I mean, frequentists use conditional/hierarchical models and have no problems with inference in light of false positives/false negatives, or empirical population proportions, etc. But I guess part of the point is that you have to inject some form of repeatability before limiting relative frequency makes sense.

The fallacy is that the null hypothesis tested was not “the sun has not gone nova” but rather “the sun has not gone nova and two dice have not come up sixes.” So the correct conclusion after rejecting this null hypothesis is “the sun has gone nova or two dice have come up sixes”, not “the sun has gone nova.”

Paul, that’s not right. One can test the hypothesis that the sun has gone nova this way if one likes, but the test has no power beyond the Type I error rate, which makes it a crappy basis for inference. By any common-sense measure, this approach is of course very very bad, but by the usual (frequentist) standards that introductory texts actually explicitly set for tests (i.e. do they control the Type I error rate) it passes. So, the only fallacious thinking is that control of the Type I error rate, alone, is sufficient to make for an adequate statistical procedure.

NB the same ‘test’ could be used in situations where repeatability is possible, e.g. in a clinical trial of a new drug.

JohnDoe — I think we are mostly in agreement. (?)

JohnDoe, I’m not sure what you say about the power is right either! Suppose that getting a “Yes” from the machine is the criterion for rejecting “H0: the sun has not gone nova”. If H0 is true then H0 is incorrectly rejected with probability 1/36 (the size of the test). On the other hand, if the alternative hypothesis “H1: the sun has gone nova” is true then H0 is rejected with probability 35/36. i.e. the frequentist test is powerful and on these grounds seems quite reasonable.

@Paul; your hypothesis includes statements about dice. This is not what the cartoon is illustrating.

@Kaplan; the test only ever rejects H0 with probability 1/36. Its power is therefore 1/36, not 35/36.

@JohnDoe: that’s not so. The cartoon says “Then, it rolls two dice. If they both come up six then it lies to us. Otherwise it tells us truth”. So, if the sun has indeed gone nova then it correctly rejects H0 w.p. 35/36.

Had the cartoon said “Then, it rolls two dice. If they both come up six then the machine says ‘YES’, otherwise it says ‘NO’ ” then I would agree with you. But, as worded in the cartoon, since the machine tells the truth w.p. 35/36, a test that rejects H0 when the machine says ‘YES’ has size 1/36 and power 35/36 – quite reasonable?!

@Kaplan. Sorry, I stand corrected.

It’s helpful to move this away from philosophy and make a technical claim. Suppose you use the usual model y=m+error for the mass of neutrino and you collect some data. When you compute the 95% CI with this data one of two things will happen: the interval will contain the true value or it won’t. Consider both cases separately,

95% CI CONTAINS THE TRUE VALUE: using the same data the Bayesian will get exactly the same result with a uniform prior on the whole real line. On the other hand, if the Bayesian uses a uniform prior over an interval [a,b] and the true mass lies in this interval, than the new 95% Bayesian interval will be even smaller than the 95% CI and still contain the true length!

Moreover, it’s not difficult to find such an [a,b] in real life. For example, you could have used [0, "mass of a hydrogen atom"] since it is probably known before any direct measurement that the mass of the neutrino lies in this interval.

95% CI DOESN’T CONTAIN THE TRUE VALUE: In this case the measurement errors are such that they’re misleading and pull the interval away from the true value. When the Bayesian repeats the calculation using a uniform prior on [a,b] they will find that some of the ‘truth’ in this prior will correct some of the ‘falsity’ in the error model and drag the interval back closer to the true value. In some cases this effect will be big enough to drag the 95% Bayesian back onto the true value.

In either case including true prior information gets us an interval that demonstratedly brings us closer to the truth about the mass of the neutrino. So if your goal in life is to be wrong 5% of the time, then 95% CIs are the way to go. If you simply want to know the mass of the neutrino than do the Bayesian calculation.

Incidentally, I might add that for a 95% CI to have actual 95% coverage you’d need to know something like “the errors over a very long sequence of trials have a histogram that looks the assumed probability distribution for the errors”

This assumption fails in real physical experiments quite a bit which is why the 95% CI’s don’t actually have 95% coverage very often.

On the hand every point I make in the above comment still holds true regardless of what the long range histogram of errors looks like.

The experiments to measure neutrino mass actually measure mass-squared. And one interesting thing so far is that the 95% confidence intervals extend to negative values.

Then it is a poorly constructed interval likely based on asymptotics that have boundary issues? Are the intervals driving the theory at that point or is it the other way around?

I don’t understand the problem here. Sure, if you just choose a prior out of the blue (pull it out of your proverbial ass, or even construct it based on prior evidence) there’s no reason to expect anything about the frequentist coverage.

But some objective Bayesian procedures produce credible intervals that have good coverage properties, see for example the paper by Berger and Mossman:

http://www.stat.duke.edu/~berger/papers/mossman.pdf

So it is misleading to imply that Bayesian procedures generally have bad coverage (as the cartoon implies). It all depends on how they are conducted, ISTM.

If the method doesn’t take into account prior information, why then to use a Bayesian method? (Taking into account that bayesian methods are usually computationally more expansive than frequentist ones)

Read the paper. The objective Bayesian method actually outperformed some of the (commonly used) frequentist methods that were investigated.

Also, in this example there is nothing particularly “computationally expensive”. I assign this to my students. A program in the computationally inefficient language R will get Berger and Mossman’s results in a few seconds on my laptop.

Bill

It was really just meant as a joke

But yes there are cases where Bayes intervals alos have frequentist coveage

But not always

And it certainly is not the goal of Bayes

Larry

Bayes has only one goal: to reason consistently under uncertainty. When it comes time to make a prediction, if the loss function concerns frequency properties, then Bayes will make predictions with good frequency properties.

This defines coverage probability very well. But many people do not understand the fact that the 95% confidence does not apply to a single result after the data has been gathered. And even if they are aware of this ‘in theory’, they still have dangerous intuitions. Could you do a blog to explain that fully? I like to consider the following:

Many experiments will be performed on each of the fundamental particles. Estimators are derived which have 95% coverage. Before each experiment is performed, we know that the probability is at least 95% that the interval will contain the true value. (Note the future tense here). These probabilities are independent of each other; in other words, the incorrect results will be spread uniformly throughout the experiments. If you ‘condition’ on anything before the experiment (such as the day of the week on which it is performed), then this independent-and-identically-distributed 95% probability still holds.

*But*, after the experiments are done and you condition on the results, these probabilities break down entirely. We are now using the past tense. I think many people don’t understand how badly they break down – we cannot select an arbitrary experiment’s interval and say the probability is 95% that it was correct. (Also, many people don’t appreciate that this breakdown of probabilities is *not* fatal to the usefulness of confidence intervals.)

Imagine drawing a scatterplot as follows: For each experiment on the fundamental particles we draw a point. The x-axis is the lower bound of the interval and the y-axis is the upper bound. We are able to spend large sums of money to get almost-perfect estimates of the true masses and we can then go back to the original scatterplot and colour the dots red or green depending on whether the interval did contain the true value.

Imagine looking at the scatterplot of red and green points. Will the red and greens dots be uniformly mixed up with each other? *No*. There may well be patterns and clumps in the colours of the dots. Therefore, nobody is entitled to pick one dot from the scatterplot (after the initial experiments, but before the huge expenditure that establishes the true value) and say “The probability is 95% that further experiments will confirm this interval”. To make such a claim is equivalent to saying that, conditioning on the data, the incorrect results are spread uniformly throughout the observed intervals.

In such circumstances, a subject matter expert could look at each of the intervals and make comments on them – “I don’t believe the muon mass is that large and I expect the expensive follow up experiments to confirm that”. The frequentist couldn’t really agree with that comment – but the frequentist could not disagree with that comment either. The frequentist can make no claim on how the incorrect results are distributed – *in particular*, they cannot throw their hands up in the air and say they are uniformly distributed.

As a particular case, let’s take one particular, the muon, and repeat the experiment many times. For each experiment, plot the CI in the form of a scatterplot as described above. Then, we spent lots of money and get the true mass of the muon and colour the dots accordingly. Clearly there will be a pattern in the colours, the green dots will be clumped together and the red dots will surround them.

Is this fair? Is it useful? (And no, I’m not comparing it to the Bayesian approach. I’m just explaining frequentist CIs.)

You raise good points.

I have decided I will indeed do a lon post on

coverage, what frequentist results mean, what

Bayesian results mean etc.

Hopefully this weekend

Thanks for the comment.

Larry

A lack of understanding of coverage seems to not be that unusual.

What if he used a matching prior? O_o

Interesting to see them in the class today :)

Larry,

You said that you believe that truths exist. What sort of truths?

When we observe nature (we are included in nature), there are at least two BIG jumps:

1. electrical signals are transformed in images;

2. images are transformed in sounds (such sounds are thereafter transformed in words).

In this process, we create words to represent “things´´ that are just images created by electrical signals. I am very skeptical that a specific word is THE thing itself. A word represent some features of nature that are interpreted by us, human beings. Such words should not have to be related to “real things´´, they just have to increase the chance to perpetuate our species, or at least they must not decrease it.

However, when it comes to statistical modelling…

Larry,

You said that you believe that truths exist. What sort of truths?

When we observe nature (we are included in nature), there are at least two BIG jumps:

1. electrical signals are transformed in images;

2. images are transformed in sounds (such sounds are thereafter transformed in words).

In this process, we create words to represent “things´´ that are just images created by electrical signals. I am very skeptical that a specific word is THE thing itself. A word represents some features of nature that are interpreted by us, human beings. Such words do not necessarily have to be related to “real things´´, they just have to increase the chance to perpetuate our species, or at least they must not decrease it.

However, when it comes to statistical modelling…

Well, I think that hydrogen atoms exist and they have a definite mass.

I think the Hubble constant has a definite value.

I think that the random variable “it will rain tomorrow” will have a

definite known value tomorrow.

## One Trackback

[...] then showed a cartoon that Larry Wasserman put on his blog. Larry’s point here is that (under most circumstances) [...]