Monthly Archives: October 2012

The Future of Machine Learning (and the End of the World?)

The Future of Machine Learning (and the End of the World?)

On Thursday (Oct 25) we had an event called the ML Futuristic Panel Discussion. The panelists were Ziv Bar-Joseph, Steve Fienberg, Tom Mitchell Aarti Singh and Alex Smola.

Ziv is an expert on machine learning and systems biology. Steve is a colleague of mine in Statistics with a joint appointment in ML, Tom is the head and founder of the ML department, Aarti is an Assistant Professor in ML and Alex, who is well known as a pioneer in kernel methods, is joining us as a professor in ML in January. An august panel to say the least.

The challenge was to predict what the next important breakthroughs in ML would be. It was also a discussion of where the panelists thought ML should be going in the future. Based on my notoriously unreliable memory, here is my summary of the key points.

1. What The Panelists Said

Aarti: ML is good at important but mundane tasks (classification etc) but not at higher level tasks like thinking of new hypotheses. We need ML techniques that play a bigger role in the whole process of making scientific discoveries. The more machines can do, the more high level tasks humans can concentrate their efforts on.

Ziv: There is a gap between the advances in systems biology and its use on practical problems, especially medicine. Each person is a repository of an unimaginable amount of data. An unsolved problem in ML is how to use all the knowledge we have developed in systems biology and use it for personalized medicine. In a sense, this is the problem of bridging information at the cell level and information at the level of an individual (consisting of trillions of interacting cells).

Steve: We should not forget the crucial role of intervention. Experiments involve manipulating variables. Passive ML methods are only part of the whole story. Statistics and ML methods help us learn, but then we have to decide what experiments to do, what interventions to make. Also, we have to decide what data to collect; not all data are useful. In other words, the future of ML has to still include human judgement.

Tom: He joked that his mother was not impressed with ML. After all, she saw Tom grow from an infant who knew nothing, to and adult who can do an amazing number of things. Tom says we need to learn how to “raise computers” in analogy to raising children. We need machines that can learn how to learn. An example is the NELL project (Never Ending Language Learning) which Tom leads. This is a system which has been running since January 2010 and is learning how to read information from the web. See also here. Amazing stuff.

Alex: More and more, computing is done on huge numbers of highly connected inexpensive processors. This raises many questions about how to design algorithms. There are interesting challenges for systems designers, ML people ad statisticians. For example: can you design an estimator that can easily be distributed with little loss of statistical efficiency and that is highly tolerant to failures of small numbers of processors?

2. The Future?

I found the panel discussion very inspiring. All of the panelists had interesting things to say. There was much discussion after the presentations. Martin Azizyan asked (and I am paraphrasing), “Have we really solved all the current ML problems?” The panel agreed that, no, we have not. We need to keep working on current problems (even if they seem mundane compared to the futuristic things discussed by the panel). But we can also work on the next generation of problems at the same time.

Discussing future trends is important. But we have to remember that we are probably wrong about our predictions. Neils Bohr said “Prediction is very difficult, especially about the future.” And as Yogi Berra said, “The future ain’t what it used to be. ”

When I was a kid, it was routinely predicted that, by the year 2000, people would fly to work with jetpacks, we’d have flying cars and we’d harvest our food from the sea. No one really predicted the world wide web, laptops, cellphones, gene microarrays etc.

3. The Return of AI

But, I’ll take my chances and make a prediction anyway. I think Tom is right: computers that learn in ways closer to the ways humans learn is the future.

When I was in London in June, I had the pleasure to meet Shane Legg, from Deepmind Technologies. This is a startup that is trying to build a system that thinks. This was the original dream of AI.

As Shane explained to me, the has been huge progress in both neuroscience and ML and their goal is to bring these things together. I thought it sounded crazy until he told me the list of famous billionaires who have invested in the company.

Which raises an interested question. Suppose someone — Tom Mitchell, the people at Deepmind, or someone else — creates a truly intelligent system. Now they have a system as smart as a human. But all they have to do is put the system on a huge machine with more horsepower than a human brain. Suddenly, we are in the world of super-intelligent computers surpassing humans.

Perhaps they’ll be nice to us. Or, it could turn into Robopocalypse. If so, this could mean the end of the world as we know it.

By the way, Daniel Wilson, the author of Robopocalypse, was a student at CMU. I heard rumours that he kept a picture of me on his desk to intimidate himself to work hard. I don’t think of myself as intimidating so maybe this isn’t true. However, the book begins with a character named Professor Wasserman, a statistics professor, who unwittingly unleashes an intelligent program that leads to the Robopocalypse.

Steve Speilberg is making a movie based on the book, to be released April 25 2104. So far, I have not had any calls from Speilberg.

So my prediction is this: someone other than me will be playing Professor Wasserman in the film adaptation of Robopocalypse.

What are your predictions for the future of ML and Statistics?


The Inside Story of the L’Aquila Affair

The Inside Story of the L’Aquila Affair

On April 6 2009 a major earthquake in L’Aquila, Italy killed hundreds of people. On October 22 2012, seven people were convicted of manslaughter for downplaying the likelihood of a major earthquake six days before it took place.

I don’t think that the scientists and engineers in Italy should go to prison for failing to clearly communicate the risk to the public.

But … most of the reporting on this side of the Atlantic has been inaccurate. No one was convicted for “failing to predict an earthquake” as has been widely reported. Here is the rest of the story.

1. What Happened

L’Aquila is a small town about 60 miles north-east of Rome. In 2009, the town experienced a swarm of tremors. A local crackpot named Giampaolo Giuliani started making his own earthquake predictions which made nervous residents even more nervous.

According to Wikipedia:

The 2009 L’Aquila earthquake occurred in the region of Abruzzo, in central Italy. The main shock occurred at 3:32 local time on 6 April 2009, and was rated 5.8 on the Richter scale and 6.3 on the moment magnitude scale; its epicentre was near L’Aquila, the capital of Abruzzo, which together with surrounding villages suffered most damage. There have been several thousand foreshocks and aftershocks since December 2008, more than thirty of which had a Richter magnitude greater than 3.5.

In a subsequent inquiry of the handling of the disaster, seven members of the Italian National Commission for the Forecast and Prevention of Major Risks were accused of giving “inexact, incomplete and contradictory” information about the danger of the tremors prior to the main quake. On 22 October 2012, six scientists and one ex-government official were convicted of multiple manslaughter for downplaying the likelihood of a major earthquake six days before it took place. They were each sentenced to six years’ imprisonment

On March 31 2009, there was a meeting in L’Aquila. of the National Commission for the Forecast and Prevention of Major Risks

According to Nature:

The now-famous commission meeting convened on the evening of 31 March in a local government office in L’Aquila. Boschi, who had travelled by car to the city with two other scientists, later called the circumstances “completely out of the ordinary”. Commission sessions are usually closed, so Boschi was surprised to see nearly a dozen local government officials and other non-scientists attending the brief, one-hour meeting, in which the six scientists assessed the swarms of tremors that had rattled the local population. When asked during the meeting if the current seismic swarm could be a precursor to a major quake like the one that levelled L’Aquila in 1703, Boschi said, according to the meeting minutes: “It is unlikely that an earthquake like the one in 1703 could occur in the short term, but the possibility cannot be totally excluded.” The scientific message conveyed at the meeting was anything but reassuring, according to Selvaggi. “If you live in L’Aquila, even if there’s no swarm,” he says, “you can never say, ‘No problem.’ You can never say that in a high-risk region.” But there was minimal discussion of the vulnerability of local buildings, say prosecutors, or of what specific advice should be given to residents about what to do in the event of a major quake. Boschi himself, in a 2009 letter to civil-protection officials published in the Italian weekly news magazine L’Espresso, said: “actions to be undertaken were not even minimally discussed”.

Enzo Boschi was president of Italy’s National Institute of Geophysics and Volcanology. Giulio Selvaggi was director of the National Earthquake Center. The Nature article goes on to say:

Many people in L’Aquila now view the meeting as essentially a public-relations event held to discredit the idea of reliable earthquake prediction (and, by implication, Giuliani) and thereby reassure local residents. Christian Del Pinto, a seismologist with the civil-protection department for the neighbouring region of Molise, sat in on part of the meeting and later told prosecutors in L’Aquila that the commission proceedings struck him as a “grotesque pantomine”. Even Boschi now says that “the point of the meeting was to calm the population. We [scientists] didn’t understand that until later on.”

What happened outside the meeting room may haunt the scientists, and perhaps the world of risk assessment, for many years. Two members of the commission, Barberi and De Bernardinis, along with mayor Cialente and an official from Abruzzo’s civil-protection department, held a press conference to discuss the findings of the meeting. In press interviews before and after the meeting that were broadcast on Italian television, immortalized on YouTube and form detailed parts of the prosecution case, De Bernardinis said that the seismic situation in L’Aquila was “certainly normal” and posed “no danger”, adding that “the scientific community continues to assure me that, to the contrary, it’s a favourable situation because of the continuous discharge of energy”. When prompted by a journalist who said, “So we should have a nice glass of wine,” De Bernardinis replied “Absolutely”, and urged locals to have a glass of Montepulciano.

On April 6, the earthquake struck, killing 309 people.

2. The Phone Call

One thing that is missing in much of the coverage in the U.S. press, is a phone call between Guido Bertolaso (Director of the Italy Civil Defense committee) and Daniela Stati (L’Aquila town councilor for civic protection). Bertolaso was already under investigation for other crimes so his phone was being tapped. The Italian Newspaper, La Repubblica, has the phone conversation on their website. The phone tap was ordered by the Italian Judiciary.

Luckily for me, my wife is from Italy and she has transcribed and translated the phone call. (Thanks Isa.) Here is her translation of the phone call and a few paragraphs from La Repubblica.


The true story of a mock meeting of the Commission for Major Risks set on March 30, 2009 during a phone conversation between Guido Bertolaso (GB) (Director of the Italy Civil Defense committee) and Daniela Stati (DS) (L’Aquila town councilor for civic protection) tapped under order of the Italian Judiciary Council.

DS Hello

GB This is Guido Bertolaso speaking.

DS Good evening. How are you doing?

GB Good — You’ll receive a phone call from De Bernardinis, my deputy. I asked him to call a meeting in L’Aquila about this issue of seismic clusters that is going on, so as to shut up, right away, any imbecile, to calm down conjectures, worries and so on.

DS Thank you Guido thank you so much.

GB But you have to tell everybody not to send out announcements claiming that no more tremors will occur. This is bullshit. Never say this type of things when speaking of quakes.

DS Absolutely.

GB Somebody told me there has been an announcement claiming there will be no more tremors. But this is something that can never be said, Daniela, not even under torture.

DS Oh I’m sorry Guido, I did not know, I’m just out from a meeting.

GB Never mind, do not worry. But you have to make sure that any announcement goes first by my press office. They [my press office] have expertise on communicating emergency information. They know how to act to avoid any boomerang effect. You know, if there is another tremor in two hours, what are we going to say? Quakes are a mine field.

DS I’ll call them right away.

GB We have got to be very very prudent. Anyway we’ll fix this issue. Tomorrow is very important. De Bernardinis will call you to decide where to set this meeting. I will not be there, but I’ll send Zamberletti, Barberi, Bosci, you know the leading lights of Italian quakes. I’ll send them to the prefect’s office or to your office. You guys decide where, I do not give a shit. This needs to be a public relations event. Do you get it?

DS Yes, yes.

GB So they, the best seismology experts, will say: “This is normal, these phenomena happen. It is better to have 100 level 4 Richter scale tremors rather than nothing. Because 100 tremors are useful for dispersing energy, so there will never be the dangerous quake. Do you understand?

DS All right. I will try to stop that announcement.

GB No, no. It has been done already. My people are covering this. Just talk with De Bernardinis and plan this meeting, and also announce it. We are doing this not because we are worried, but because we want to reassure people. So instead of you and me having a conversation, the best seismology scientists will talk tomorrow.

DS Everything will be all right.

3. My Assessment

Telling people there is no danger, is not the same as failing to predict the earthquake. There was clearly a failure to communicate the risks to the public. And saying that the swarm of tremors reduced the risk seems blatantly misleading. Government officials pressured the scientists into playing down the risks. The scientists were used by impatient and dishonest bureaucrats.

I don’t think it makes sense to prosecute these guys. The bottom line is that earthquake prediction is difficult, everyone knows this, and L’Aquila is known to be in a seismically active area. It’s not like the seismologists actually knew there would be an earthquake and decided to keep it secret. Based on the available information, they presumably did believe that the probability of a big earthquake was low.

They may have mishandled the communication of risk, some of them more than others, but this hardly deserves criminal prosecution and six years of imprisonment.

On the other hand, the government officials who pressured people to play down the risks and who seemed to have no interest in honestly investigating the situation are perhaps more culpable.

As they said in the Corriere della Sera on Oct 24 2012:

The conviction of multiple manslaughter of the seven members of the Italian committee Great Risks is, whether we like it or not, a political sentence. In any other country, where expertise is judged with scientific criteria, politicians would have borne the accountability of their shortcomings.

So the real story of L’Aquila, is the government using scientists as scapegoats.

4. Postscript

The victims of the earthquake were also victims of poor treatment from the Berlusconi government. From Wikipedia:

Around 40,000 people who were made homeless by the earthquake found accommodation in tented camps and a further 10,000 were housed in hotels on the coast. Others sought shelter with friends and relatives throughout Italy. Prime Minister Silvio Berlusconi caused a controversy when he said, in an interview to the German station n-tv, that the homeless victims should consider themselves to be on a “camping weekend” – “They have everything they need, they have medical care, hot food… Of course, their current lodgings are a bit temporary. But they should see it like a weekend of camping.” To clarify his thought, he also told the people in a homeless camp: “Head to the beach. It’s Easter. Take a break. We’re paying for it, you’ll be well looked after.”

A Rant on Refereeing

Before I started this blog, I posted an essay on my webpage about refereeing called A World Without Referees. There was a bit of discussion about it on the blogosphere. I argued that our peer review system is outdated and unfair.

David Banks has raised this issue here in the Amstat News. Karl Rohe also has an excellent commentary here.

Since I have never posted my original essay on my blog I decided that I should do so now. Here it is. Comments welcome as always.

(For a dissenting view, see Nicolas Chopin’s post on Christian’s blog here.)

Note: For those who have already read this essay, please note that at the end I have added a short postscript which wasn’t in the original.

A World Without Referees

Our current peer review is an authoritarian system resembling a priesthood or a guild. It made sense in the 1600’s when it was invented. Over 300 years later we are still using the same system. It is time to modernize and democratize our approach to scientific publishing.

1. Introduction

The peer review system that we use was invented by Henry Oldenburg, the first editor of the Philosophical Transactions of the Royal Society in 1665. We are using a refereeing system that is almost 350 years old. If we used the same printing methods as we did in 1665 it would be considered laughable. And yet few question our ancient refereeing process.

In this essay I argue that our current peer review process is bad and should be eliminated.

2. The Problem With Peer Review

The refereeing process is very noisy, time consuming and arbitrary. We should be disseminating our research as widely as possible. Instead, we let two or three referees stand in between our work and the rest of our field. I think that most people are so used to our system, that they reflexively defend it when it is criticized. The purpose of doing research is to create new knowledge. This knowledge is useless unless it is disseminated. Refereeing is an impediment to dissemination.

Every experienced researcher that I know has many stories about having papers rejected because of unfair referee reports. Some of this can be written off as sour grapes, but not all of it. In the last 24 years I have been an author, referee, associate editor and editor. I have seen many cases where one referee rejected a paper and another equally qualified referee accepted it. I am quite sure that if I had sent the paper to two other referees, anything could have happened. Referee reports are strongly affected by the personality, mood and disposition of the referee. Is it fair that you work hard on something for two years only to have it casually dismissed by a couple of people who might happen to be in a bad mood or who feel they have to be critical for the sake of being critical?

Some will argue that refereeing provides quality control. This is an illusion. Plenty of bad papers get published and plenty of good papers get rejected. Many think that the stamp of approval by having a paper accepted by the refereeing process is crucial for maintaining the integrity of the field. This attitude treats a field as if it is a priesthood with a set of infallible, wise elders deciding what is good and what is bad. It is also like a guild, which protects itself by making it harder for outsiders to compete with insiders.

We should think about our field like a marketplace of ideas. Everyone should be free to put their ideas out there. There is no need for referees. Good ideas will get recognized, used and cited. Bad ideas will be ignored. This process will be imperfect. But is it really better to have two or three people decide the fate of your work?

Imagine a world without refereeing. Imagine the time and money saved by not having journals, by not having editors, associated editors and imagine never having to referee a paper again. It’s easy if you try.

3. A World Without Referees

Young statisticians (and some of us not so young ones) put our papers on the preprint server arXiv ( This is the best and easiest way to disseminate research. If you don’t check arXiv for new papers every day, then you are really missing out.

So a simple idea is just to post your papers on arxiv. If the paper is good, people will read it. If they find mistakes, you can thank them a post a revision. Pretty simple.

Walter Noll is a Professor Mathematics at Carnegie Mellon. He suggests that we all just post our papers on our own websites. Here is a quote from his paper The Future of Scientific Publication.

1) Every author should put an invitation like the following on his or her website: Any comments, reviews, critiques, or objections are invited and should be sent to the author by e-mail. (I have this on my website.) The author should reply to any response and initiate a discussion.

2) Every author should notify his or her worldwide colleagues as soon as a new paper has been published on the website.

3) The traditional review journals (e.g. Mathematical reviews and Zentralblatt), or perhaps a new online journal, should invite the appropriate public to submit reviews, counter-reviews, and discussions of papers on websites and publish them with only minor editing.

4) Promotion committees in universities should give credit to faculty members for writing reviews.

The “publish on your own website” model can be used in concert with the arXiv model.

4. Questions and Answers

Question: Won’t we be deluged by papers? I rely on referees to filter out the bad papers.

Answer: I hope we are deluged with papers. That would be great. But I doubt it will be a problem. Math and Physics, who rely heavily on the arXiv model, have done just fine.

If you rely on referees to filter papers for you then I think you are making a huge error. Do you really want referees deciding what papers you get to read? Would like two referees to decide what wines can be sold at the winestore? Isn’t the overwhelming selection of wine a positive rather than a negative? Wouldn’t you prefer having a wide selection so you can decide yourself? Do you really want your choices limited by others? Anyway, if there does end up being a flood of papers then smart, enterprising people will respond by creating websites and blogs that tell you what’s out there, review papers, etc. That’s a much more open, democratic approach.

Question: What is the role of journals in a world without referees?

Answer: The same as the role of punch cards.

Question: How about grants?

Answer: I think we still do need referees here. (Although flying 20 people to Washington for a panel review is ludicrous and unnecessary, but that’s another story.)

Question: How about bad papers?

Answer: Ignore them or critique them. But don’t suppress them.

Question: How about promotion cases?

Answer: Every promotion case includes a few letter writers who know the area and will be able to write substantial letters. They don’t need the approval of a journal to tell them whether the papers are good. But there will also be some letter writers who are less familiar with the candidate or the field. Sometimes these people just count papers in big journals. But you can always just look at their CV and quickly peruse a few of the candidate’s papers. That doesn’t take much time and is certainly no worse than paper counting.

Question: How about medical research?

Answer: There is arguably danger in bad medical papers. But again, I think the answer is to critique rather than suppress. However, I am mainly focusing on areas I am more familiar with, like statistics, computer science etc.

5. Conclusion

When I criticize the peer review process I find that people are quick to agree with me. But when I suggest getting rid of it, I usually find that people rush to defend it. Is it because the system is good or is it because we are so used to it that we just assume it has to be this way?

In three years we will reach the 350th birthday of the peer review system. Let’s hope we can come up with better ideas before then. At the very least we can have a discussion about it.

6. Postscript: An Analogy

In her book The Future and Its Enemies, Virginia Postrel discusses in detail the fact that the birth of new ideas is a messy, unpredictable process. She describes people who accept the unsupervised, unpredictable nature of progress as dynamists. She describes those who fear the disorderly, trial-and-error process of knowledge discovery, as stasists. She divides the stasists into two groups: the reactionaries who oppose progress and the technocrats who try to control progress with bureaucracy and centralized decision making. I classify our current system as technocratic and I am arguing for a more dynamist approach.

Proof That Theory Matters

Proof That Theory Matters

The title of this post is a slight exaggeration but I do want to discuss an interesting paper by Steve Stigler that provides empirical support for the fact that:

… there is a tendency for influence to flow from theory to applications to a much greater extent than in the reverse direction.

Steve is a professor in the department of statistics at the University of Chicago and is known, among many other things, for his scholarly work on the history of statistics. His father was the Nobel prize winning economist George Stigler.

1. The Paper

The paper is “Citation patterns in the journals of statistics and probability,” which appeared in Statistical Science in 1994 (p 94-108). I think you can get it from the following link.

The article examines citation data between various journals. Obviously the data are now out of date. And there are many practical problems with citation data which Stigler discusses at length in the article.

The part of the paper I want to focus on is where Stigler adopts an economic point if view and treats citations as a form of trade. Citations are viewed as imports and exports. Restricting to eight major journals, Stigler assigns an export score {S} to each journal. The difference of these scores measures the exporting power of one journal to another. Specifically, he defines

\displaystyle  {\rm logodds} (A\ {\rm exports\ to\ B}| {\rm A\ and\ B\ trade})= S_A - S_B.

There are eight journals but only seven free parameters (since the relevant quantities are differences). So, without loss of generality, he takes The Annals of Statistics to be the baseline journal with score {S=0}. This results in the following export scores:

Journal Score
Annals 0.00
Biometrics -1.19
Biometrika -0.35
Communications -3.27
JASA -0.81
JRSS B -0.06
JRSS C -1.30
Technometrics -0.98

To quote from the paper: “The larger the export score, the greater the propensity to export intellectual influence.” In particular, we see that The Annals is the largest exporter.

Later, he puts the journals into three groups: theory (Annals, Biometrika, JRSS B), applied (Biometrics, Technometrics, JRSS C) and mixed (JASA). The result is:

Theory 0.00
Mixed -0.66
Applied -0.99

Again we see a flow from theory to applied. The importance of this finding should not be underestimated. Quoting again from the paper:

Thus, we have striking evidence supporting a fundamental role for basic theory that runs strongly counter to the sometimes voiced claim that basic theory is not relevant to applicable methodology.

Another interesting finding in the paper is that there is very little intellectual trade between statistics journals and probability journals.

2. Conclusion

I am fascinated by this paper. The role of theory versus applied work is sometimes controversial and I believe this is one of the few quantitative studies about this issue. The data on which the study was based are now out of date. It would be great if someone would do an updated analysis. And of course we have many new sources of information such as Google Scholar.

3. Reference

Stigler, S.M. (1994). Citation patterns in the journals of statistics and probability. Statistical Science, 94-108.

The Robins-Ritov Example: A Post-Mortem

The Robins-Ritov Example: A Post-Mortem

This post is follow-up to the two earlier posts on the Robins-Ritov example. We don’t want to prolong the debate but, rather, just summarize our main points.

1. Summary

  1. The Horwitz-Thompson estimator {\hat \psi} satisfies the following condition: for every {\epsilon>0},

    \displaystyle  \sup_{\theta\in\Theta}\mathbb{P}(|\hat \psi - \psi| > \epsilon) \leq 2 \exp\left(- 2 n \epsilon^2 \delta^2\right) \ \ \ \ \ (1)

    where {\Theta} — the parameter space — is the set of all functions {\theta: [0,1]^d \rightarrow [0,1]}. (There are practical improvements to the Horwitz-Thompson estimator that we discussed in our earlier posts but we won’t revisit those here.)

  2. A Bayes estimator requires a prior {W(\theta)} for {\theta}. In general, if {W(\theta)} is not a function of {\pi} then (1) will not hold. (And in our earlier post we argued that in realistic settings, the prior would in fact not depend on {\pi}.)
  3. If you let {W} be a function if {\pi}, (1) still, in general, does not hold.
  4. If you make {W} a function if {\pi} in just the right way, then (1) will hold. Stefan Harmeling and Marc Toussaint have a nice paper which shows one way to do this. And we showed an improved Bayesian estimator that depends on {\pi} in our earlier post. There is nothing wrong with doing this, but in our opinion this is not in the spirit of Bayesian inference. Constructing a Bayesian estimator to have good frequentist properties is really just frequentist inference.
  5. Chris Sims pointed out in his notes that the Bayes estimator does well in the parametric case. We agree: we never said otherwise. To quote from Chris’ notes: I think probably the arguments Robins and Wasserman want to make do depend fundamentally on infinite-dimensionality – that is, on considering a situation where {\theta(\cdot)} lies in an infinite-dimensional space and we want to avoid restricting ourselves to a topologically small subset of that space in advance. That’s exactly correct. The problem we are discussing is the nonparametric case.
  6. The supremum in (1) is important. When we say that the estimator concentrates around the truth uniformly, we are referring to the presence of the supremum. A Bayes estimator can converge in the non-uniform sense. That is, it can satisfy

    \displaystyle  \mathbb{P}(|\hat \psi - \psi| > \epsilon) \leq 2 \exp\left(- 2 n \epsilon^2 \delta^2\right) \ \ \ \ \ (2)

    for some {\theta}‘s in {\Theta}. In particular, if the prior {W(\theta)} is highly concentrated around some function {\theta_0} and if {\theta_0} happens to be the true function, then of course something like (2) will hold. But if the prior is not concentrated around the truth, (1) won’t hold.

  7. This example is only meant to show that Bayesian estimators do not necessarily have good frequentist properties. This should not be surprising. There is no reason why we should in general expect a Bayesian method to have a frequentist property like (1).
  8. This example was presented in a simplified form to make it clear. In an observational study, the function {\pi} is also unknown. In that case, when {X} is high dimensional, the best that can be hoped for is a “doubly robust” (DR) estimator that performs well if either (but not necessarily both) {\pi} or {\theta} is accurately modelled. The locally semiparametric efficient regression of our original post with {\pi} estimated is an example. DR estimators are now routinely used in biostatistics. They have also caught the attention of researchers at Google (Lambert and Pregibon 2007, Chan, Ge, Gershony, Hesterberg and Lambert 2010) and Yahoo! (Dudik, Langford and Li 2011). Bayesian approaches to modelling {\pi} and {\theta} have been used in the construction of the DR estimator (Cefalu, Dominici, and Parmigiani 2012).

2. A Sociological Comment

We are surprised by how defensive Bayesians are when we present this example. Consider the following (true) story.

One day, professor X showed LW an example where maximum likelihood does not do well. LW’s response was to shrug his shoulders and say: “that’s interesting. I won’t use maximum likelihood for that example.”

Professor X was surprised. He felt that by showing one example where maximum likelihood fails, he had discredited maximum likelihood. This is absurd. We use maximum likelihood when it works well and we don’t use maximum likelihood when it doesn’t work well.

When Bayesians see the Robins-Ritov example (or other similar examples) why don’t they just shrug their shoulders and say: “that’s interesting. I won’t use Bayesian inference for that example.” Some do. But some feel that if Bayes fails in one example then their whole world comes crashing down. This seems to us to be an over-reaction.

3. References

Cefalu, M. and Dominici, F. and Parmigiani, G. (2012). Model Averaged Double Robust Estimation. Harvard University Biostatistics Working Paper Series. link.

Chan, D., Ge, R., Gershony, O., Hesterberg, T. and Lambert, D. (2010). Evaluating online ad campaigns in a pipeline: causal models at scale. Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, 7-16.

Dudik, M., Langford, J. and Li, L. (2011). Doubly Robust Policy Evaluation and Learning. Arxiv preprint arXiv:1103.4601.

Harmeling, S. and Toussaint, M. (2007). Bayesian Estimators for Robins-Ritov’s Problem. Technical Report. University of Edinburgh, School of Informatics.

Lambert, D. and Pregibon, D. (2007). More bang for their bucks: assessing new features for online advertisers. Proceedings of the 1st international workshop on Data mining and audience intelligence for advertising, 7-15.

The Normalizing Constant Paradox

Recently, there was a discussion on stack exchange about an example in my book. The example is a paradox about estimating normalizing constants. The analysis of the problem in my book is wrong; more precisely, the analysis is my book is meant to show that just blindly applying Bayes’ rule does not always yield a correct posterior distribution.

This point was correctly noted by a commenter at stack exchange who uses the name “Zen.” I don’t know who Zen is, but he or she correctly identified the problem with the analysis in my book.

However, it is still an open question how to do the analysis properly. Another commenter, David Rohde, identified the usual proposal which I’ll review below. But as I’ll explain, I don’t think the usual answer is satisfactory.

The purpose of this post is to explain the paradox and then I want to ask the question: does anyone know how to correctly solve the problem?

The example, by the way, is due to my friend Ed George.

1. Problem Description

The posterior for a parameter {\theta} given data {Y_1,\ldots, Y_n} is

\displaystyle  p(\theta|Y_1,\ldots, Y_n) =\frac{L(\theta)\pi(\theta)}{c}

where {L(\theta)} is the likelihood function, {\pi(\theta)} is the prior and {c= \int L(\theta)\pi(\theta)} is the normalizing constant. Notice that the function {L(\theta)\pi(\theta)} is known.

In complicated models, especially where {\theta} is a high-dimensional vector, it is not possible to do the integral {c= \int L(\theta)\pi(\theta)}. Fortunately, we may not need to know the normalizing constant. However, there are occasions where we do need to know it. So how can we compute {c} when we can’t do the integral?

In many cases we can use simulation methods (such as MCMC) to draw a sample {\theta_1,\ldots, \theta_n} from the posterior. The question is: how can we use the sample {\theta_1,\ldots, \theta_n} from the posterior to estimate {c}?

More generally, suppose that

\displaystyle  f(\theta) = \frac{g(\theta)}{c}

where {g(\theta)} is known but we cannot compute the integral {c = \int g(\theta) d \theta}. Given a sample {\theta_1,\ldots, \theta_n \sim f}, how do we estimate {c}?

2. Frequentist Estimator

We can use the sample to compute a density estimator {\hat f(\theta)} of {f(\theta)}. Note that {c = g(\theta)/f(\theta)} for all {\theta}. This suggests the estimator

\displaystyle  \hat c = \frac{g(\theta_0)}{\hat f(\theta_0)}

where {\theta_0} is an arbitrary value of {\theta}.

This is only one possible estimator. In fact, there is much research on the problem of finding good estimators of {c} from the sample. As far as I know, all of them are frequentist.

As David Rohde notes on stack exchange, there is a certain irony to the fact the Bayesians use frequentist methods to estimate the normalizing constant of their posterior distributions.

3. A Bogus Bayesian Analysis

Let’s restate the problem. We have a sample {\theta_1,\ldots, \theta_n} from {f(\theta)=g(\theta)/c}. The function {g(\theta)} is known but we don’t know the constant {c = \int g(\theta) d\theta} and it is not feasible to do the integral.

In my book, I consider the following Bayesian analysis. The analysis is wrong, as I’ll explain in a minute.

We have an unknown quantity {c} and some data {\theta_1,\ldots, \theta_n}. We should be able to do Bayesian inference for {c}. We start by placing a prior {h(c)} on {c}. The posterior is obtained by multiplying the prior and the likelihood:

\displaystyle  h(c|\theta_1,\ldots, \theta_n) = h(c) \prod_{i=1}^n \frac{g(\theta_i)}{c} \propto h(c) c^{-n}

where we dropped the terms {g(\theta_i)} since they are known.

The “posterior” {h(c|\theta_1,\ldots, \theta_n) \propto h(c) c^{-n}} is useless. It does not depend on the data. And it may not even be integrable.

The point of the example was to point out that blindly applying Bayes rule is not always wise. As I mentioned earlier, Zen correctly notes that my application of Bayes rule is not valid. The reason is that, I acted as if we had a family of densities {f(\theta|c)} indexed by {c}. But we don’t: {f(\theta)=g(\theta)/c} is a valid density only for one value of {c}, namely, {c = \int g(\theta)d\theta}. (To get a valid posterior from Bayes rule, we need a family {f(x|\psi)} which is a valid distribution for {x}, for each value of {\psi}.)

4. A Correct Bayesian Analysis?

The usual Bayesian approach that I have seen is to pretend that the function {g} is unknown. Then we place a prior on {g} (such as a Gaussian process prior) and proceed with a Bayesian analysis. However, this seems a unsatisfactory. It seems to me that we should be able to get a valid Bayesian estimator for {c} with pretending not to know {g}.

Christian Robert discussed the problem on his blog. If I understand what Christian has written, he claims that this cannot be considered a statistical problem and that we can’t even put a prior on {c} because it is a constant. I don’t find this point of view convincing. Isn’t the whole point of Bayesian inference that we can put distributions on fixed but unknown constants? Christian says that this is a numerical problem not a statistical problem. But we have data sampled from a distribution. To me, that makes it a statistical problem.

5. The Answer Is …

So what is a valid Bayes estimator of {c}? Pretending I don’t know {g} or simply declaring it to be a non-statistical problem seem like giving up.

I want to emphasize that this is not meant in any way as a critique of Bayes. I really think there should be a good Bayesian estimator here but I don’t know what it is.

Anyone have any good ideas?

Testing Millions of Hypotheses: FDR

In the olden days, multiple testing meant testing 8 or 9 hypotheses. Today, multiple testing can involve testing thousands or even million of hypotheses.

A revolution occurred with the publication of Benjamini and Hochberg (1995). The method introduced in that paper has made it feasible to test huge numbers of hypotheses with high power. The Benjamini and Hochberg method is now standard in areas like genomics.

1. Multiple Testing

We want to test a large number of null hypotheses {H_1,\ldots, H_N}. Let {H_j =0} if the {j^{\rm th}} null hypothesis is true and let {H_j =1} if the {j^{\rm th}} null hypothesis is false. For example, {H_j} might be the hypothesis that there is no difference in mean gene expression level between healthy and diseased tissue, for the {j^{\rm th}} gene.

For each hypothesis {H_j} we have a test statistic {Z_j} and a p-value {P_j} computed from the test statistic. If {H_j} is true (no difference) then {P_j} has a uniform distribution {U} on {(0,1)}. If {H_j} is false (there is a difference) then {P_j} has some other distribution, typically more concentrated towards 0.

If we were testing one hypotheses, we would reject the null hypothesis if the p-value is less than {\alpha}. The type I error — the probability of a false rejection — is then {\alpha}. But in multiple testing we can’t simply reject all hypotheses for which {P_j \leq \alpha}. When {N} is large, we will make many type I errors.

A common and very simple way to fix the problem is the Bonferroni method: reject when {P_j\leq \alpha/N}. The set of rejected hypotheses is

\displaystyle  R = \Bigl\{j:\ P_j \leq \frac{\alpha}{N}\Bigr\}.

It follows from the union bound that

\displaystyle  {\rm Probability\ of\ any\ false\ rejections\ }= P(R \cap {\cal N}) \leq \alpha

where {{\cal N} = \{j:\ H_j =0\}} is the set of true null hypotheses.

The problem with the Bonferroni method is that the power — the probability of rejecting {H_j} when {H_j=1} — goes to 0 as {N} increases.

2. FDR

Instead of controlling the probability of any false rejections, the Benjamini-Hochberg (BH) method controls the false discovery rate (FDR) defined to be

\displaystyle  {\rm FDR} = \mathbb{E}({\rm FDP})


\displaystyle  {\rm FDP} = \frac{F}{R},

{F} is the number of false rejections and {R} is the number of rejections. Here, FDP is the false discovery proportion.

The BH method works as follows. Let

\displaystyle  P_{(1)}\leq P_{(2)}\leq \cdots \leq P_{(N)}

be the ordered p-values. The rejection set is

\displaystyle  R = \Bigl\{j:\ P_j \leq T\Bigr\}

where {T=P_{(j)}} and

\displaystyle  j = \max \left\{ s:\ P_{(s)} \leq \frac{\alpha s}{N} \right\}.

(If the p-values are not independent, an adjustment may be required.) Benjamini and Hochberg proved that, if this method is used then {{\rm FDR} \leq \alpha}.

3. Why Does It Work?

The original proof that {{\rm FDR} \leq \alpha} is a bit complicated. A slick martingale proof can be found in Storey, Taylor and Siegmund (2003). Here, I’ll give a less than rigorous but very simple proof.

Suppose the fraction of true nulls is {\pi}. The distribution of the p-values can be written as

\displaystyle  G = \pi U + (1-\pi)A

where {U(t)=t} is the uniform (0,1) distribution (the nulls) and {A} is some other distribution on (0,1) (the alternatives). Let

\displaystyle  \hat G(t) = \frac{1}{N}\sum_{j=1}^N I(P_j \leq t)

be the empirical distribution of the p-values. Suppose we reject all p-values less than {t}. Now

\displaystyle  F = \sum_{j=1}^n I(P_j\leq t, H_j=0)


\displaystyle  R = \sum_{j=1}^n I(P_j\leq t) = N \hat{G}(t).


\displaystyle  \begin{array}{rcl}  \mathbb{E}\left( \frac{F}{R}\right) &=& \mathbb{E}\left(\frac{\sum_{j=1}^n I(P_j\leq t, H_j=0)}{\sum_{j=1}^n I(P_j\leq t)}\right)\\ &=& \frac{\mathbb{E}\left(\sum_{j=1}^n I(P_j\leq t, H_j=0)\right)} {\mathbb{E}\left(\sum_{j=1}^n I(P_j\leq t)\right)} + O\left(\sqrt{\frac{1}{N}}\right)\\ &=& \frac{N P(P_j \leq t|H_j=0) P(H_j=0)}{N G(t)} + O\left(\sqrt{\frac{1}{N}}\right)\\ &=& \frac{t\pi}{G(t)} + O\left(\sqrt{\frac{1}{N}}\right)= \frac{t\pi}{\hat G(t)} + O_P\left(\sqrt{\frac{1}{N}}\right) \end{array}

and so

\displaystyle  {\rm FDR} \approx \frac{t\pi}{\hat G(t)}.

Now let {t} be equal to one of the ordered p-values, say {P_{(j)}}. Thus {t = P_{(j)}}, {\hat G(t) = j/N} and

\displaystyle  {\rm FDR} \approx \frac{\pi P_{(j)}N}{j} \leq \frac{P_{(j)}N}{j}.

Setting the right hand side to be less than or equal to {\alpha} yields

\displaystyle  \frac{N P_{(j)}}{j} \leq \alpha

or in other words, choose {t=P_{(j)}} to satisfy

\displaystyle  P_{(j)} \leq \frac{\alpha j}{N}

which is exactly the BH method.

To summarize: we reject all p-values less than {T=P_{(j)}} where

\displaystyle  j = \max \left\{ s:\ P_{(s)} \leq \frac{\alpha s}{N} \right\}.

We then have the guarantee that {FDR \leq \alpha}.

The method is simple and, unlike Bonferroni, the power does not go to 0 as {N\rightarrow\infty}.

There are now many modifications to the BH method. For example, instead of controlling the mean of the FDP you can choose {T} so that

\displaystyle  P({\rm FDP} > \alpha) \leq \beta

which is called FDP control. (Genovese and Wasserman 2006). One can also weight the p-values (Genovese, Roeder, and Wasserman 2006).

4. Limitations

FDR methods control the error rate while maintaining high power. But it is important to realize that these methods give weaker control than Bonferroni. FDR controls the (expected) fraction of false rejections. Bonferroni protects you from making any false rejections. Which you should use is very problem dependent.

5. References

Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological). 289–300.

Genovese, C.R. and Roeder, K. and Wasserman, L. (2006). False discovery control with p-value weighting. Biometrika, 93, 509-524.

Genovese, C.R. and Wasserman, L. (2006). Exceedance control of the false discovery proportion. Journal of the American Statistical Association, 101, 1408-1417.

Storey, J.D. and Taylor, J.E. and Siegmund, D. (2003). Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 66, 187–205.