## Statistics and Dr. Strangelove

One of the biggest embarrassments in statistics is that we don’t really have confidence bands for nonparametric functions estimation. This is a fact that we tend to sweep under the rug.

Consider, for example, estimating a density ${p(x)}$ from a sample ${X_1,\ldots, X_n}$. The kernel estimator with kernel ${K}$ and bandwidth ${h}$ is

$\displaystyle \hat p(x) = \frac{1}{n}\sum_{i=1}^n \frac{1}{h}K\left( \frac{x-X_i}{h}\right).$

Let’s start with getting a confidence interval at a single point ${x}$.

Let ${\overline{p}(x) = \mathbb{E}(\hat p(x))}$ be the mean of the estimator and let ${s_n(x)}$ be the standard deviation. (A consistent estimator of ${s_n(x)}$ is ${\sqrt{a^2/n}}$ where ${a}$ is the sample standard deviation of ${U_1,\ldots, U_n}$ where ${U_i = h^{-1} K(x-X_i/h)}$.)

Now,

$\displaystyle \frac{\hat{p}(x) - p(x)}{s_n(x)} = \frac{\hat{p}(x) - \overline{p}(x)}{s_n(x)} + \frac{\overline{p}(x)-p(x)}{s_n(x)} = \frac{\hat{p}(x) - \overline{p}(x)}{s_n(x)} + \frac{b(x)}{s_n(x)}$

where ${b(x) = \overline{p}(x)-p(x)}$ is the bias. By the central limit theorem, the first term is approximately standard Normal,

$\displaystyle \frac{\hat{p}(x) - \overline{p}(x)}{s_n(x)} \approx N(0,1).$

This suggests the confidence interval, ${\hat p(x) \pm z_{\alpha/2} s_n(x)}$.

The problem is the second term ${\frac{b(x)}{s_n(x)}}$. Recall that the optimal bandwidth ${h}$ balances the bias and the variance. This means that if we use the best bandwidth, then ${\frac{b(x)}{s_n(x)}}$ behaves like a constant, that is, ${\frac{b(x)}{s_n(x)}\rightarrow c}$ for some (unknown) constant ${c}$. The result is that, even as ${n\rightarrow\infty}$, the confidence interval is centered at ${p(x) + c}$ instead of ${p(x)}$. As a result, the coverage will be less than the intended coverage ${1-\alpha}$.

We could in principle solve this problem by undersmoothing. If we choose a bandwidth smaller than the optimal bandwidth, then ${\frac{b(x)}{s_n(x)}\rightarrow 0}$ and the problem disappears. But this creates two new problems. First, the confidence interval is now centered around a non-optimal estimator. Second, and more importantly, we don’t have any practical, data-driven methods for choosing an undersmoothed bandwidth. There are some proposals in the theoretical literature but they are, frankly, not very practical

Of course, all of these problems are only worse when we consider simultaneous confidence band for the whole function.

The same problem arises if we use Bayesian inference. In short, the 95 percent Bayesian interval will not have 95 percent frequentist coverage. The reasons are the same as for the frequentist interval. (One can of course “declare” that coverage is not important and then there is no problem. But I am assuming that we do want correct frequentist coverage.)

The only good solution I know is to simply admit that we can’t create a confidence interval for ${p(x)}$. Instead, we simply interpret ${\hat p(x) \pm z_{\alpha/2} s_n(x)}$ as a confidence interval for ${\overline{p}(x)}$, which is a smoothed out (biased) version of ${p(x)}$:

$\displaystyle \overline{p}(x) = \int K(u) p(x+hu) \,du \approx p(x) + C h^2$

where ${C}$ is an unknown constant. In other words, we just live with the bias. (Of course, the bias gets smaller as the sample size gets larger and ${h}$ gets smaller.)

Once we take this point of view, we can easily get a confidence band using the bootstrap. We draw ${X_1^*,\ldots, X_n^*}$ from the empirical distribution ${P_n}$ and compute the density estimator ${\hat p^*}$. By repeating this many times, we can find the quantile ${z_\alpha}$ defined by

$\displaystyle P\Biggl( \sqrt{nh}||\hat p(x)^*-\hat p(x)||_\infty > z_\alpha \Biggm| X_1,\ldots, X_n\Biggr) = \alpha.$

The confidence band is then ${(\ell(x),u(x))}$ where

$\displaystyle \ell(x) = \hat p(x) - \frac{z_\alpha}{\sqrt{nh}},\ \ \ u(x) = \hat p(x) + \frac{z_\alpha}{\sqrt{nh}}.$

We then have that

$\displaystyle P\Biggl( \ell(x) \leq \overline{p}(x) \leq u(x)\ \ \ {\rm for\ all\ }x\Biggr)\rightarrow 1-\alpha$

as ${n\rightarrow \infty}$. Again, this is a confidence band for ${\overline{p}(x)}$ rather than for ${p(x)}$.

Summary: There really is no good, practical method for getting true confidence bands in nonparametric function estimation. The reason is that the bias does not disappear fast enough as the sample size increases. The solution, I believe, is to just acknowledge that the confidence bands have a slight bias. Then we can use the bootstrap (or other methods) to compute the band. Like Dr. Strangelove, we just learn to stop worrying and love the bias.

## The JSM, Minimaxity and the Language Police

The JSM, Minimaxity and the Language Police

I am back from the JSM. For those who don’t know, the JSM is the largest statistical meeting in the world. This year there were nearly 6,000 people.

Some people hate the JSM because it is too large. I love JSM. There is so much going on: lots of talks, receptions, tutorials and, of course, socializing. It’s impossible to walk 10 feet in any direction without bumping into another old friend.

1. Highlights

On Sunday I went to a session dedicated to the 70th birthday of my colleague Steve Fienberg. This was a great celebration of a good friend and remarkable statistician. Steve has more energy at 70 than I did when I was 20.

On a sadder note, I went to a memorial session for my late friend George Casella. Ed George, Bill Strawderman, Roger Berger, Jim Berger and Marty Wells talked about George’s many contributions and his huge impact on statistics. To borrow Ed’s catchphrase: what a good guy.

On Tuesday, I went to Judea Pearl’s medallion lecture, with discussions by Jamie Robins and Eric Tchetgen Tchetgen. Judea gave an unusual talk, mixing philosophy, metaphors (“eagles and snakes can’t build microscopes”) and math. Judea likes to argue that graphical models/structural equation models are the best way to view causation. Jamie and Eric argued that graphs can hide certain assumptions and that counterfactuals need to be used in addition to graphs.

Digression. There is another aspect of causation that deserves mention. It’s one thing to write down a general model to describe causation (using graphs or counterfactuals). It’s another matter altogether to construct estimators for the causal effects. Jamie Robins is well-known for his theory of “G-estimation”. At the risk of gross oversimplification, this is an approach to sequential causal modeling that eventually leads to semiparametric efficient estimators. The construction of these estimators is highly non-trivial. You can’t get them by just writing down a density for the graph and estimating the distribution. Nor can you just throw down some parametric models for the densities in the factorization implied by the graph. If you do, you will end up with a model in which there is no parameter that equals 0 when there is no causal effect. I call this the “null paradox.” And Jamie’s estimators are useful. Indeed, on Monday, there was a session called “Causal Inference in Observational Studies with Time-Varying Treatments.” All the talks were concerned with applying Jamie’s methods to develop estimators in real problems such as organ transplantation studies. The point is this: when it comes to causation, just writing down a model using graphs or counterfactuals is only a small part of the story. Finding estimators is often much more challenging.
End Digression.

On Wednesday, I had the honor of giving the Rietz Lecture (named after Henry Rietz, the first president of the Institute of Mathematical Statistics). I talked about Topological Inference. I will post the slides on my web page soon. I was fortunate to be introduced by my good friend Ed George.

2. Minimaxity

One thing that came up at my talk, but also in several discussions I had with people at the conference is the following problem. Recall that the minimax risk is

$\displaystyle R=\inf_{\hat \theta}\sup_{P\in {\cal P}}E_P[ L(\theta(P),\hat\theta)]$

where ${L}$ is a loss function, ${{\cal P}}$ is a set of distributions and the infimum is over all estimators. Often we can define an estimator that achieves the minimax rate for a problem. But the estimator that achieves the risk ${R}$ may not be computable in any practical sense. This happens in some of the topological problems I discussed. It also happens in sparse PCA problems. I suspect we are going to come across this issue more and more.

I suggested to a few people that we should really define the minimax risk by restricting ${\hat\theta}$ to be an estimator that can be computed in polynomial time. But how would we then compute the minimax risk? Some people, like Mike Jordan and Martin Wainwright, have made headway in studying the tradeoff between optimality and computation. But a general theory will be hard. I did mention this to Aad van der Vart at dinner one night. If anyone can make headway with this problem, it is Aad.

3. Montreal

The conference was in Montreal. If you have never been there, Montreal is a very pleasant city; vibrant and full of good restaurants. We had a very nice time including good dinners and cigars at a classy cigar bar.

Rant: The only bad thing about Montreal is the tyrannical language police. (As a dual citizen, I think I am allowed to criticize Canada!) Put a sign on your business only in English and go to jail. Very tolerant. This raises lots interesting problems: Is “pasta” English? What if you are using it as a proper name, as in naming your business: Pasta! And unless you live in North Korea, why should anyone be able to tell you what language to put on your private business? We heard one horror story that is almost funny. A software company opened in Montreal. The language police checked an approved of their signage. (Yes. They really pay people to do this.) Then they were told they had to write their code in French. I’m not even sure what this means. Is C code English or French? Anyway, the company left Montreal.

To be fair, it’s not just Montreal that lives under the brutal rule of bureaucrats and politicians. Every country does. But Canada does lean heavily toward arbitrary (i.e. favoring special groups) rules. Shane Jensen pointed me to Can’tada which lists stuff you can’t use in Canada.
End Rant.

Despite my rant, I think Montreal is a great city. I highly recommend it for travel.

4. Summary

The JSM is lots of fun. Yes, it is a bit overwhelming but where else can you find 6000 statisticians in one place? So next year, when I say I don’t want to go, someone remind me that I actually had a good time.

## The Steep Price of Sparsity

The Steep Price of Sparsity

We all love sparse estimators these days. I am referring to things like the lasso and related variable selection methods.

But there is a dark side to sparsity. It’s what Hannes Leeb and Benedikt Potscher call the “return of the Hodges’ estimator.” Basically, any estimator that is capable of producing sparse estimators has infinitely bad minimax risk.

1. Hodges Estimator

Let’s start by recalling Hodges famous example.

Suppose that ${X_1,\ldots,X_n \sim N(\theta,1)}$. Define ${\hat\theta}$ as follows:

${\hat\theta =\overline{X}}$ if ${|\overline{X}| \geq n^{-1/4}}$

${\hat\theta=0}$ if ${|\overline{X}| < n^{-1/4}}$.

If ${\theta\neq 0}$, then ${\hat\theta}$ will equal ${\overline{X}}$ for all large ${n}$. But if ${\theta=0}$, then eventually ${\hat\theta =0}$. The estimator discovers that ${\theta}$ is 0.

This seems like a good thing. This is what we want whenever we do model selection. We want to discover that some coefficients are 0. That’s the nature of using sparse methods.

But there is a price to pay for sparsity. The Hodges estimator has the unfortunate property that the maximum risk is terrible. Indeed,

$\displaystyle \sup_\theta \mathbb{E}_\theta [n (\hat\theta-\theta)^2] \rightarrow \infty.$

Contrast this will the sample mean:

$\displaystyle \sup_\theta \mathbb{E}_\theta [n (\overline{X}-\theta)^2] =1.$

The reason for the poor behavior of the Hodges estimator — or any sparse estimator — is that there is a zone near 0 where the estimator will be very unstable. The estimator has to decide when to switch from ${\overline{X}}$ to 0 creating a zone of instability.

I plotted the risk function of the Hodges estimator here. The risk of the mle is flat. The large peaks in the risk function of the Hodges estimator are very clear (and very disturbing).

2. The Steep Price of Sparsity

Leeb and Potscher (2008) proved that this poor behavior holds for all sparse estimators and all loss functions. Suppose that

$\displaystyle Y_i = x_i^T \beta + \epsilon_i,\ \ \ i=1,\ldots, n.$

here, ${\beta\in \mathbb{R}^k}$. Let ${s(\beta)}$ be the support of ${\beta}$: ${s_j(\beta)=1}$ of ${\beta_j \neq 0}$ and ${s_j(\beta)=0}$ of ${\beta_j = 0}$. Say that ${\hat\beta}$ is sparsistent (a term invented by Pradeep Ravikumar) if, for each ${\beta}$,

$\displaystyle P^n_\beta(s(\hat\beta)=s(\beta)) \rightarrow 1$

as ${n\rightarrow \infty}$.

Leeb and Potscher (2008) showed that if ${\hat\beta}$ is sparsistent, then

$\displaystyle \sup_\beta E[ n\, ||\hat\beta-\beta||^2]\rightarrow \infty$

as ${n\rightarrow \infty}$. More generally, for any non-negative loss function ${\ell(\hat\beta-\beta)}$, we have

$\displaystyle \sup_\beta E[ \ell(n^{1/2}(\hat\beta -\beta))]\rightarrow \sup_\beta \ell(\beta).$

3. How Should We Interpret This Result?

One might object that the maximum risk is too conservative and includes extreme cases. But in this case, that is not true. The high values of the risk occur in a small neighborhood of 0. (Recall the picture of the risk of the Hodges estimator.) This is far from pathological.

Another objection is that the proof assumes ${n}$ grows while ${k}$ stays fixed. But in many applications that we are interested in, ${k}$ grows and is even possibly larger than ${n}$. This is a valid objection. On the other hand, if there is unfavorable behavior in the ideal case of fixed ${k}$, we should not be sanguine about the high-dimensional case.

I am not suggesting we should give up variable selection. I use variable selection all the time. But we should keep in mind that there is a steep price to pay for sparsistency.

References

Leeb, Hannes and Potscher, Benedikt M. (2008). Sparse estimators and the oracle property, or the return of Hodges’ estimator. Journal of Econometrics, 142, 201-211.

Leeb, Hannes and Potscher, Benedikt M. (2005). Model selection and inference: Facts and fiction. Econometric Theory, 21, 21-59.

## THE FIVE: Jeff Leek’s Challenge

Jeff Leek, over at Simply Statistics asks an interesting question: What are the 5 most influential statistics papers of 2000-2010?

I found this to be incredibly difficult to answer. Eventually, I came up with this list:

Donoho, David (2006). Compressed sensing. IEEE Transactions on Information Theory. 52, 1289-1306.

Greenshtein, Eitan and Ritov, Ya’Acov. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli, 10, 971-988.

Meinshausen, Nicolai and Buhlmann, Peter. (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34, 1436-1462.

Efron, Bradley and Hastie, Trevor and Johnstone, Iain and Tibshirani, Robert. (2004). Least angle regression. The Annals of statistics, 32, 407-499.

Hofmann, Thomas and Scholkopf, Bernhard and Smola, Alexander J. (2008). Kernel methods in machine learning. The Annals of Statistics. 1171–1220.

These are all very good papers. These papers had a big impact on me. More precisely, they are representative of ideas that had an impact on me. It’s more like there are clusters of papers and these are prototypes from those clusters. I am not really happy with my list. I feel like I must be forgetting some really important papers. Perhaps I am just getting old and forgetful. Or maybe our field is not driven by specific papers.

What five would you select? (Please post them at Jeff’s blog too.)

## LOST CAUSES IN STATISTICS II: Noninformative Priors

LOST CAUSES IN STATISTICS II: Noninformative Priors

I thought I would post at a higher frequency in the summer. But I have been working hard to finish some papers which has kept me quite busy. So, apologies for the paucity of posts.

Today I’ll discuss another lost cause: noninformative priors.

I like to say that noninformative priors are the perpetual motion machines of statistics. Everyone wants one but they don’t exist.

By definition, a prior represents information. So it should come as no surprise that a prior cannot represent lack of information.

The first “noninformative prior” was of course the flat prior. The major flaw with this prior is lack of invariance: if it is flat in one parameterization it will not be flat in most other parameterizations. Flat prior have lots of other problems too. See my earlier post here.

The most famous noninformative prior (I’ll stop putting quotes around the phrase from now on) is Jeffreys prior which is proportional to the square root of the determinant of the Fisher information matrix. While this prior is invariant, it can still have undesirable properties. In particular, while it may seem noninformative for a parameter ${\theta}$ it can end up being highly informative for functions of ${\theta}$. For example, suppose that ${Y}$ is multivariate Normal with mean vector ${\theta}$ and identity covariance. The Jeffreys prior is the flat prior ${\pi(\theta) \propto 1}$. Now suppose that we want to infer ${\psi = \sum_j \theta_j^2}$. The resulting posterior for ${\psi}$ is a disaster. The coverage of the Bayesian ${1-\alpha}$ posterior interval can be close to 0.

This is a general problem with noninformative priors. If ${\pi(\theta)}$ is somehow noninformative for ${\theta}$, it may still be highly informative for sub-parameters, that is for functions ${\psi = g(\theta)}$ where ${\theta\in \mathbb{R}^d}$ and ${\psi: \mathbb{R}^d \rightarrow \mathbb{R}}$.

Jim Berger and Jose Bernardo wrote a series of interesting papers about priors that were targeted to be noninformative for particular functions of ${\theta}$. These are often called reference priors. But what if you are interested in many functions of ${\theta}$. Should you use a different prior for each function of interest?

A more fundamental question is: what does it mean for a prior to be noninformative? Of course, people have argued about this for many, many years. One definition, which has the virtue of being somewhat precise, is that a prior is noninformative if the ${1-\alpha}$ posterior regions have frequentist coverage equal (approximately) to ${1-\alpha}$. These are sometimes called “matching priors.”

In general, it is hard to construct matching priors especially in high-dimensional complex models. But matching priors raise a fundamental question: if your goal is to match frequentist coverage, why bother with Bayes at all? Just use a frequentist confidence interval.

These days I think that most people agree that the virtue of Bayesian methods is that it gives you a way to include prior information in a systematic way. There is no reason to formulate a “noninformative prior.”

On the other hand, in practice, we often deal with very complex, high-dimensional models. Can we really formulate a meaningful informative prior in such problems? And if we do, will anyone care about our inferences?

In 1996, I wrote a review paper with Rob Kass on noninformative priors (Kass and Wasserman 1996). We emphasized that a better term might be “default prior” since that seems more honest and promises less. One of our conclusions was:

“We conclude that the problems raised by the research on priors chosen by formal rules are serious and may not be dismissed lightly: When sample sizes are small (relative the number of parameters being estimated), it is dangerous to put faith in any default solution; but when asymptotics take over, Jeffreys’s rules and their variants remain reasonable choices.”

Looking at this almost twenty years later, the one thing that has changed is the “the number of parameters being estimated” which these days is often very, very large.

My conclusion: noninformative priors are a lost cause.

Reference

Kass, Robert E and Wasserman, Larry. (1996). The selection of prior distributions by formal rules. Journal of the American Statistical Association, 91, 1343-1370.

## LOST CAUSES IN STATISTICS I: Finite Additivity

LOST CAUSES IN STATISTICS I: Finite Additivity

I decided that I’ll write an occasional post about lost causes in statistics. (The title is motivated by Streater (2007).) Today’s post is about finitely additive probability (FAP).

Recall how we usually define probability. We start with a sample space ${S}$ and a ${\sigma}$-algebra of events ${{\cal A}}$. A real-valued function ${P}$ on ${{\cal A}}$ is a probability measure if it satisfies three axioms:

(A1) ${P(A) \geq 0}$ for each ${A\in {\cal A}}$.

(A2) ${P(S)=1}$.

(A3) If ${A_1,A_2,\ldots}$ is a sequence of disjoint events then

$\displaystyle P\Bigl(A_1 \bigcup A_2 \bigcup \cdots \Bigr) = \sum_{i=1}^\infty P(A_i).$

The third axiom, countable additivity, is rejected by some extremists. In particular, Bruno de Finetti was a vocal opponent of (A3). He insisted that probability should only be required to satisfy the additivity rule for finite unions. If ${P}$ is only required to satisfy the additivity rule for finite unions, we say it is a finitely additive probability measure.

Axioms cannot be right or wrong; they are working assumptions. But some assumptions are more useful than others. Countable additivity is undoubtedly useful. The entire edifice of modern probability theory is built on countable additivity. Denying (A3) is like saying we should only use rational numbers rather than real numbers.

Proponents of FAP argue that it can express concepts that cannot be expressed with countably additive probability. Consider the natural numbers

$\displaystyle S = \{1,2,3,\ldots, \}.$

There is no countably additive probability measure that puts equal probability on every element of ${S}$. That is, there is no uniform probability on ${S}$. But there are finitely additive probabilities that are uniform on ${S}$. For example, you can construct a finitely additive probability ${P}$ for which ${P(\{i\})=0}$ for each ${i}$. This does not contradict the fact that ${P(S)=1}$ unless you invoke (A3).

You can also decide to assign probability 1/2 to the even numbers and 1/2 to the odd numbers. Again this does not conflict with each integer having probability 0 as long as you do not insist on countable additivity.

These features are considered to be good things by fans of FAP. To me, these properties make it clear why finitely additive probability is a lost cause. You have to give up the ability to compute probabilities. With countably additive probability, we can assign mass ${p(s)}$ to each element ${s\in S}$ and then we can derive the probability of any event ${A}$ by addition:

$\displaystyle P(A) = \sum_{s\in A}p(s).$

This simple fact is what makes probability so useful. But for FAP, you cannot do this. You have to assign probabilities rather than calculate them.

In FAP, you also give up basic calculation devices such as the law of total probability: if ${B_1,B_2,\ldots,}$ is a disjoint partition then

$\displaystyle P(A) = \sum_j P(A|B_j) P(B_j).$

This formula is, in general, not true for FAP. Indeed, as my colleagues Mark Schervish, Teddy Seidenfeld and Jay Kadane showed, (see Schervish et al (1984)), every probability measure that is finitely but not countably additive, exhibits non-conglomerability. This means that there is an event ${E}$ and a countable partition ${B_1,B_2,\ldots,}$ such that ${P(E)}$ is not contained in the interval

$\displaystyle \Bigl[\inf_j P(E|B_j),\ \sup_j P(E|B_j)\Bigr].$

To me, all of this suggests that giving up countable additivity is a mistake. We lose some of the most useful and intuitive properties of probability.

For these reasons, I declare finitely additive probability theory to be a lost cause.

Other lost causes I will discuss in the future include fiducial inference and pure likelihood inference. I am tempted to put neural nets into the lost cause category although the recent work on deep learning suggests that may be hasty. Any suggestions?

References

Schervish, Mark, Seidenfeld, Teddy and Kadane, Joseph. (1984). The extent of non-conglomerability of finitely additive probabilities. Zeitschrift f\”{u}r Wahrscheinlichkeitstheorie und Verwandte Gebiete, Volume 66, pp 205-226.

Streater, R. (2007). Lost Causes in and Beyond Physics. Springer.

Imagine a treatment with the following properties:

 The treatment is good for men (E1) The treatment is good for women (E2) The treatment bad overall (E3)

That’s the essence of Simpson’s paradox. But there is no such treatment. Statements (E1), (E2) and (E3) cannot all be true simultaneously.

Simpson’s paradox occurs when people equate three probabilistic statements (P1), (P2), (P3) described below, with the statements (E1), (E2), (E3) above. It turns out that (P1), (P2), (P3) can all be true. But, to repeat: (E1), (E2), (E3) cannot all be true.

The paradox is NOT that (P1), (P2), (P3) are all true. The paradox only occurs if you mistakenly equate (P1-P3) with (E1-E3).

1. Details

Throughout this post I’ll assume we have essentially an infinite sample size. The confusion about Simpson’s paradox is about population quantities so we needn’t focus on sampling error.

Assume that ${Y}$ is binary. The key probability statements are:

 ${P(Y=1|X=1,Z=1) - P(Y=1|X=0,Z=1) > 0}$ (P1) ${P(Y=1|X=1,Z=0) - P(Y=1|X=0,Z=0) > 0}$ (P2) ${P(Y=1|X=1) - P(Y=1|X=0) < 0}$ (P3)

Here, ${Y}$ is the outcome (${Y=1}$ means success, ${Y=0}$ means failure), ${X}$ is treatment (${X=1}$ means treated, ${X=0}$ means not-treated) and ${Z}$ is sex (${Z=1}$ means male, ${Z=0}$ means female).

It is easy to construct numerical examples where (P1), (P2) and (P3) are all true. The confusion arises if we equate the three probability statements (P1-P3) with the English sentences (E1-E3).

To summarize: it is possible for (P1), (P2), (P3) to all be true. It is NOT possible for (E1), (E2), (E3) to all be true. The error is in equating (P1-P3) with (E1-E3).

To capture the English statements above, we need causal language, either counterfactuals or causal directed graphs. Either will do. I’ll use counterfactuals. (For an equivalent explanation using causal graphs, see Pearl 2000). Thus, we introduce ${(Y_1,Y_0)}$ where ${Y_1}$ is your outcome if treated and ${Y_0}$ is your outcome if not treated. We observe

$\displaystyle Y = X Y_1 + (1-X) Y_0.$

In other words, if ${X=1}$ we observe ${Y_1}$ and if ${X=0}$ we observe ${Y_0}$. We never observe both ${Y_1}$ and ${Y_0}$ on any person. The correct translation of (E1), (E2) and (E3) is:

 ${P(Y_1=1|Z=1) - P(Y_0=1|Z=1) >0}$ (C1) ${P(Y_1=1|Z=0) - P(Y_0=1|Z=0) >0}$ (C2) ${P(Y_1=1) - P(Y_0=1) <0}$ (C3)

These three statements cannot simultaneously be true. Indeed, if the first two statements hold then

$\displaystyle \begin{array}{rcl} P(Y_1=1) - P(Y_0=1) &=& \sum_{z=0}^1 [P(Y_1=1|Z=z) - P(Y_0=1|Z=z)] P(z)\\ & > & 0. \end{array}$

Thus, (C1)+(C2) implies (not C3). If the treatment is good for mean and good for women then of course it is good overall.

To summarize, in general we have

$\displaystyle (E1) = (C1) \neq (P1)$

$\displaystyle (E2) = (C2) \neq (P2)$

$\displaystyle (E3) = (C3) \neq (P3)$

and, moreover (E3) cannot hold if both (E1) and (E2) hold.

The key is that, in general,

$\displaystyle \begin{array}{rcl} P(Y=1|X=1,Z=1) &-& P(Y=1|X=0,Z=1)\\ & \neq & P(Y_1=1|Z=1) - P(Y_0=1|Z=1) \end{array}$

$\displaystyle \begin{array}{rcl} P(Y=1|X=1,Z=0) &-& P(Y=1|X=0,Z=0)\\ & \neq & P(Y_1=1|Z=0) - P(Y_0=1|Z=0) \end{array}$

and

$\displaystyle P(Y=1|X=1) - P(Y=1|X=0) \neq P(Y_1=1) - P(Y_0=1).$

In other words, correlation (left hand side) is not equal to causation (right hand side).

Now, if treatment is randomly assigned, then ${X}$ is independent of ${(Y_0,Y_1)}$ and

$\displaystyle P(Y=1|X=1,Z=z) = P(Y_1=1|X=1,Z=z) = P(Y_1=1|Z=z)$

and so we will not observe the reversal, even for the correlation statements. That is, when ${X}$ is randomly assigned, (P1-P3) cannot all hold.

In the non-randomized case, we can only recover the causal effect by conditioning on all possible confounding variables ${W}$. (Recall a confounding variable is a variable that affects both ${X}$ and ${Y}$.) This is because ${X}$ is independent of ${(Y_0,Y_1)}$ conditional on ${W}$ (that’s what it means to control for confounders) and we have

$\displaystyle \begin{array}{rcl} P(Y_1=1) &=& \sum_w P(Y_1=1|W=w) P(W=w)\\ &=& \sum_w P(Y_1=1|X=1,W=w) P(W=w) \\ &=& \sum_w P(Y=1|X=1,W=w) P(W=w) \\ \end{array}$

and similarly, ${P(Y_0=1) = \sum_w P(Y=1|X=0,W=w) P(W=w)}$ and so

$\displaystyle \begin{array}{rcl} P(Y_1=1) &-& P(Y_0=1)\\ & = & \sum_w [P(Y=1|X=1,W=w) - P(Y=1|X=0,W=w) ] P(W=w) \end{array}$

which reduces the causal effect into a formula involving only observables. This is usually called the adjusted treatment effect. Now, if it should happen that there is only one confounding variable and it happens to be our variable ${Z}$ then

$\displaystyle \begin{array}{rcl} P(Y_1=1) &-& P(Y_0=1)\\ &=& \sum_w [P(Y=1|X=1,Z=z) - P(Y=1|X=0,Z=z) ] P(Z=z). \end{array}$

In this case we get the correct causal conclusion by conditioning on ${Z}$. That’s why people usually call the conditional answer correct and the unconditional statement misleading. But this is only true if ${Z}$ is a confounding variable and, in fact, is the only confounding variable.

Some texts make it seem as if the conditional answers (P2) and (P3) are correct and (P1) is wrong. This is not necessarily true. There are several possibilities:

1. ${Z}$ is a confounder and is the only confounder. Then (P3) is misleading and (P1) and (P2) are correct causal statements.

2. There is no confounder. Moreover, conditioning on ${Z}$ causes confounding. Yes, contrary to popular belief, conditioning on a non-confounder can sometimes cause confounding. (I discuss this more below.) In this case, (P3) is correct and (P1) and (P2) are misleading.

3. ${Z}$ is a confounder but there are other unobserved confounders. In this case, none of (P1), (P2) or (P3) are causally meaningful.

Without causal language— counterfactuals or causal graphs— it is impossible to describe Simpson’s paradox correctly. For example, Lindley and Novick (1981) tried to explain Simpson’s paradox using exchangeability. It doesn’t work. This is not meant to impugn Lindley or Novick— known for their important and influential work— but just to point out that you need the right language to correctly resolve a paradox. In this case, you need the language of causation.

3. Conditioning on Nonconfounders

I mentioned that conditioning on a non-confounder can actually create confounding. Pearl calls this ${M}$-bias. (For those familar with causal graphs, this is bascially the fact that conditioning on a collider creates dependence.)

To elaborate, suppose I want to estimate the causal effect

$\displaystyle \theta = P(Y_1=1) - P(Y_1=0).$

If ${Z}$ is a confounder (and is the only confounder) then we have the identity

$\displaystyle \theta = \sum_z [P(Y=1|X=1,Z=z)-P(Y=1|X=0,Z=z)] P(Z=z)$

that is, the causal effect is equal to the adjusted treatment effect. Let us write

$\displaystyle \theta = g[ p(y|x,z),p(z)]$

to indicate that the formula for ${\theta}$ is a function of the distributions ${p(y|x,z)}$ and ${p(z)}$.

But, if ${Z}$ is not a confounder, does the equality still hold? To simplify the discussion assume that there are no other confounders. Either ${Z}$ is a confounder or there are no confounders. What is the correct identity for ${\theta}$? Is it

$\displaystyle \theta = P(Y=1|X=1) - P(Y=1|X=0)$

or

$\displaystyle \theta = \sum_z [P(Y=1|X=1,Z=z)-P(Y=1|X=0,Z=z)] P(Z=z)?$

The answer (under these assumptions) is this: if ${Z}$ is not a confounder then the first identity is correct and if ${Z}$ is a confounder then the second identity is correct. In the first case ${\theta = g[p(y|x)]}$.

Now, when there are no confounders, the first identity is correct. But is the second actually incorrect or will it give the same answer as the first? The answer is: sometimes they gave the same answer but it is possible to construct situations where

$\displaystyle \theta \neq \sum_z [P(Y=1|X=1,Z=z)-P(Y=1|X=0,Z=z)] P(Z=z).$

(This is something that Judea Pearl has often pointed out.) In these cases, the correct formula for the causal effect is the first one and it does not involve conditioning on ${Z}$. Put simply, conditioning on a non-confounder can (in certain situations) actually cause confounding.

4. Continuous Version

A continuous version of Simpson’s paradox, sometimes called the ecological fallacy, looks like this:

Here we see that increasing doses of drug ${X}$ lead to poorer outcomes (left plot). But when we separate the data by sex (${Z}$) the drug shows better outcomes for higher doses for both males and females.

5. A Blog Argument Resolved?

We saw that in some cases ${\theta}$ is a function of ${p(y|x,z)}$ but in other cases it is only a function of ${p(y|x)}$. This fact led to an interesting exchange between Andrew Gelman and Judea Pearl and, later, Pearl’s student Elias Bareinboim. See, for example, here and here and here.

As I recall (Warning! my memory could be wrong), Pearl and Bareinboim were arguing that in some cases, the correct formula for the causal effect was the first one above which does not involve conditioning on ${Z}$. Andrew was arguing that conditioning was a good thing to do. This led to a heated exchange.

But I think they were talking past each other. When they said that one should not condition, they meant that the formula for the causal effect ${\theta}$ does not involve the conditional distribution ${p(y|x,z)}$. Andrew was talking about conditioning as a tool in data analysis. They were each using the word conditioning but they were referring to two different things. At least, that’s how it appeared to me.

6. References

For numerical examples of Simpson’s paradox, see the Wikipedia article.

Lindley, Dennis V and Novick, Melvin R. (1981). The role of exchangeability in inference. The Annals of Statistics, 9, 45-58.

Pearl, J. (2000). Causality: models, reasoning and inference, {Cambridge Univ Press}.

## Stephen Ziliak Rejects Significance Testing

In an opinion piece in the Financial Post, Stephen Ziliak goes into the land of hyperbole, declaring that all significance testing is junk science. It starts like this:

I want to believe as much as the next person that particle physicists have discovered a Higgs boson, the so-called “God particle,” one with a mass of 125 gigaelectronic volts (GeV). But so far I do not buy the statistical claims being made about the discovery. Since the claims about the evidence are based on “statistical significance” – that is, on the number of standard deviations by which the observed signal departs from a null hypothesis of “no difference” – the physicists’ claims are not believable. Statistical significance is junk science, and its big piles of nonsense are spoiling the research of more than particle physicists.

He goes on to say:

Statistical significance stinks. In statistical sciences from economics to medicine, including some parts of physics and chemistry, the ubiquitous “test” for “statistical significance” cannot, and will not, prove that a Higgs boson exists, any more than it can prove the reality of God, the existence of a good pain pill, or the validity of loose monetary policy.

While I have said many times in this blog that I, too, think significance testing is mis-used, it is ridiculous to jump to the conclusion that “Statistical significance is junk science.” Ironically, Mr. Ziliak is engaging in exactly the same all-or-nothing thinking that he is criticizing.

You name any statistical method: confidence intervals, Bayesian inference, etc. and it is easy to find people mis-using it. The fact that people mis-use or misunderstand a statistical method does not render it dangerous. The blind and misinformed use of any statistical method is dangerous. Statistical ignorance is the enemy. Mr. Ziliak’s singular focus on the evils of testing seems more cultish than scientific.

## Happy Birthday Normal Deviate

Today is the one year anniversary of this blog. First of all, thanks to all the readers. And special thanks to commenters and guest posters. This seems like a good time to assess whether I have achieved my goals for the blog and to get suggestions on how I might proceed in year two.

GOALS. My goals in starting the blog were:

(1) To discuss random things that I happen to find interesting.

(2) To discuss ideas at the interface of statistics and machine learning.

(3) To post every other day.

Goal 1: Achieved.

Goal 2: Partially achieved.

Goal 3: Failed miserably. I was clearly too ambitious. I am lucky if I post once per week.

THE BEST AND WORST. Favorite post: flatland. I still think this is one of the coolest and deepest paradoxes in statistics.

Least Favorite Post: This post where I was dismissive of PAC learning. I think I was just in a bad mood.

LESSON LEARNED. Put “Bayes”, “Frequentist” or “p-value” in the title of a blog post and you get zillions of hits. Put some combination of them and get even more. If I really wanted to get a big readership I would just post exclusively about this stuff. But it would get boring pretty fast.

GOING FORWARD. I hope to keep posting about once per week. But I don’t have any plans to make any specific changes to the blog. I am, however, open to suggestions.

Any suggestions for making the blog more interesting or more fun?

Any suggestions for inducing more people to write comments?

Any topics you would like me to cover? (I already promised to do one on Simpson’s paradox).

## The Value of Adding Randomness

In computer science it is common to use randomized algorithms. The same is true in statistics: there are many ways that adding randomness can make things easier. But the way that randomness enters, varies quite a bit in different methods. I thought it might be interesting to collect some specific examples of statistical procedures where added randomness plays some role. (I am not referring to the randomness inherent in the original data but, rather, I refer to randomness in the statistical method itself.)

(1) Randomization in causal inference. The mean difference ${\alpha}$ between a treated group and untreated group is not, in general, equal to the causal effect ${\theta}$. (Correlation is not causation.) Moreover, ${\theta}$ is not identifiable. But if we randomly assign people to the two groups then, magically, ${\theta =\alpha}$. This is easily proved using either the directed graph approach to causation or the counterfactual approach: see here for example. This fact is so elementary that we tend to forget how amazing it is. Of course, this is the reason we spend millions of dollars doing randomized trials.

(As an aside, some people say that there is no role for randomization in Bayesian inference. In other words, the randomization mechanism plays no role in Bayes’ theorem. But this is not really true. Without randomization, we can indeed derive a posterior for ${\theta}$ but it is highly sensitive to the prior. This is just a restatement of the non-identifiability of ${\theta}$. With randomization, the posterior is much less sensitive to the prior. And I think most practical Bayesians would consider it valuable to increase the robustness of the posterior.)

(2) Permutation Tests. If ${X_1,\ldots, X_n \sim P}$ and ${Y_1,\ldots, Y_m \sim Q}$ and you want to test ${H_0: P=Q}$ versus ${H_1: P\neq Q}$, you can get an exact, distribution-free test by using the permutation method. See here. We rarely search over all permutations. Instead, we randomly select a large number of permutations. The result is still exact (i.e. the p-value is sub-uniform under ${H_0}$.)

(3) The Bootstrap. I discussed the bootstrap here. Basically, to compute a confidence interval, we approximate the distribution

$\displaystyle L_n=\mathbb{P}( \sqrt{n}(\hat\theta - \theta) \leq t)$

with the conditional distribution

$\displaystyle \hat L_n=\mathbb{P}( \sqrt{n}(\hat\theta^* - \hat\theta) \leq t\ | X_1,\ldots, X_n)$

where ${\hat\theta = g(X_1,\ldots, X_n)}$, ${\hat\theta^* = g(X_1^*,\ldots, X_n^*)}$ and ${X_1^*,\ldots, X_n^*}$ is a sample from the empirical distribution. But the distribution ${\hat L_n}$ is intractable. Instead, we approximate it by repeatedly sampling from the empirical distribution function. This makes otherwise intractable confidence intervals trivial to compute.

(4) ${k}$-means++. Minimizing the objective function in ${k}$-means clustering is NP-hard. Remarkably, as I discussed here, the ${k}$-means++ algorithm uses a careful randomization method for choosing starting values and gets close to the minimum with high probability.

(5) Cross-Validation. Some forms of cross-validation involve splitting the data randomly into two or more groups. We use one part(s) for fitting and the other(s) for testing. Some people seem bothered by the randomness this introduces. But it makes risk estimation easy and accurate.

(6) MCMC. An obvious and common use of randomness is random sampling from a posterior distribution, usually by way of Markov Chain Monte Carlo. This can dramatically simplify Bayesian inference.

These are the first few things that came to my mind. Are there others I should add to the list?