Monthly Archives: November 2012

The Density Cluster Tree: A Guest Post

Today, we have a guest post by Sivaraman Balakrishnan. Siva is a graduate student in the School of Computer Science. The topic of his post is an algorithm for clustering. The algorithm finds the connected components of the level sets of an estimate of the density. The algorithm — due to Kamalika Chaudhuri and Sanjoy DasGupta— is very simple and comes armed with strong theoretical guarantees.

Aside: Siva’s post mentions John Hartigan. John is a living legend in the field of statistics. He’s done fundamental work on clustering, Bayesian inference, large sample theory, subsampling and many other topics. It’s well worth doing a Google search and reading some of his papers.

Before getting to Siva’s post, here is a picture of a density and some of its level set clusters. The algorithm Siva will describe finds the clusters at all levels (which then form a tree).

THE DENSITY CLUSTER TREE
by
SIVARAMAN BALAKRISHNAN

1. Introduction

Clustering is widely considered challenging both practically and theoretically. One of the main reasons for this is that often the true goals of clustering are not clear and this makes clustering seem poorly defined.

One of the most concrete and intuitive ways to define clusters when data are drawn from a density {f} is on the basis of level sets of {f}, i.e. for any {\lambda} the connected components of

\displaystyle  \{x: f(x) \geq \lambda\}

form the clusters at level {\lambda}. This leaves the question of how to select the “correct” {\lambda}, and typically we simply sweep over {\lambda} and present what is called the density cluster tree.

However, we usually do not have access to {f} and would like to estimate the cluster tree of {f} given samples drawn from {f}. Recently, Kamalika Chaudhuri and Sanjoy Dasgupta (henceforth CD) presented a simple estimator for the cluster tree in a really nice paper: Rates of convergence for the cluster tree (NIPS, 2010) and showed it was consistent in a certain sense.

This post is about the notion of consistency, the CD estimator, and its analysis.

2. Evaluating an estimator: Hartigan’s consistency

The first notion of consistency for an estimated cluster tree was introduced by J.A. Hartigan in his paper: Consistency of single linkage for high-density clusters (JASA, 1981).

Given some estimator {\Theta_n} of the cluster tree of {f} (i.e. a collection of hierarchically nested sets), we say it is consistent if:

For any sets {A,A^\prime \subset \mathbb{R}^d}, let {A_n} (respectively {A^\prime_n}) denote the smallest cluster of {\Theta_n} containing the samples in {A} (respectively {A}). {\Theta_n} is consistent if, whenever {A} and {A^\prime} are different connected components of {\{x : f(x) \geq \lambda \}} (for some {\lambda > 0}), {P(A_n} is disjoint from {A^\prime_n) \rightarrow 1} as {n \rightarrow \infty}.

Essentially, we want that if we have two separated clusters at some level {\lambda} then the cluster tree must reflect this, i.e. the smallest clusters containing the samples from each of these clusters must be disconnected from each other.

To give finite sample bounds, CD introduced a notion of saliently separated clusters and showed that these clusters can be identified using a small number of samples (as a by-product their results also imply Hartigan consistency for their estimators). Informally, clusters are saliently separated if they satisfy two conditions.

  1. Separation in {\mathbb{R}^d}: We would expect that clusters that are two close cannot be identified in a finite sample.
  2. Separation in the density {f}: There should a sufficiently big region of low density separating the clusters. Again, we would expect that if the “bridge” between the clusters doesn’t dip enough then we might (incorrectly) conclude they are the same cluster from a finite sample.

3. An algorithm

The CD estimator is based on the following algorithm:

  1. INPUT: {k}

  2. For {r = [0,\infty)}, discard all points with {r_k(x) > r} where {r_k(x)} is the distance to the {k^{\rm th}} nearest neighbor of {x}. Connect {x,y} if {||x-y|| \leq \alpha r}, to form {G(r)}.

  3. OUTPUT: Return connected components of {G(r)}.

CD show that their estimator is consistent (and give finite sample rates for saliently separated clusters) if we select {k \sim d \log n} and {\alpha \geq \sqrt{2}}.

It is actually true that any density estimate that is uniformly close to the true density can be used to construct a Hartigan consistent estimator. However, this involves finding the connected components of the level sets of the estimator, which can be hard. The nice thing about the CD estimator is that it is completely algorithmic.

3.1. Detour 1: Single linkage

Single linkage is a popular linkage clustering algorithm, and essentially corresponds to the case of {\alpha = 1, k = 2}. Given its popularity an important question to ask is whether the single linkage tree is Hartigan consistent. This was answered by Hartigan in his original paper, affirmatively for {d=1} but negatively for {d > 1}.

The main issue is an effect called “chaining” which causes single linkage to merge clusters before fully connecting up the clusters within themselves. The reason for this is that single-linkage is not sufficiently sensitive to the density separation, i.e. even if there is a region of low density between two clusters single linkage might form a “chain” across it because it is mostly oblivious to the density of the sample.

Returning to the CD estimator: one intuitive way to understand the estimator is to observe that for a fixed {r}, the first step discards points on the basis of their distance to their {k}-th NN. This is essentially cleaning the sample to remove points in regions of low-density (as measured by a {k}-NN density estimate). This step makes the algorithm density sensitive and prevents chaining.

4. Analysis

So how does one analyze the CD estimator? We essentially need to show that for any two saliently separated clusters at a level {\lambda}, there is some radius {r} at which:

  1. We do not clean out any points in the clusters.
  2. The clusters are internally connected at the radius {\alpha r}.
  3. The clusters are mutually separated at this radius.

To establish each of these we will first need to understand how the {k}-NN distance of the sample points behave given a finite sample.

4.1. Detour 2: Uniform large deviation inequalities

As before we are given {n} random samples {\{X_1, \ldots, X_n\}} from a density {f} on {\mathbb{R}^d}. Let’s say we are interested in a measurable subset {A} of the space {\mathbb{R}^d}. A fundamental question is: How close is the empirical mass of {A} to the true mass of {A}? i.e. we would like to relate the quantities {f_n(A) = n^{-1}\sum_{i=1}^n \mathbb{I} (X_i \in A)} and {f(A)\equiv \int_A f(x)dx}. Notice that this is essentially the same as the question: if I toss a coin with bias {f(A)}, {m} times, on average how many heads will I see?

A standard way to answer these questions quantitatively is using a large deviation inequality like Hoeffding’s inequality. Often we have multiple sets {\mathcal{A} = \{A_1, \ldots, A_N\}} and we’d like to relate {f_n(A_i)} to {f(A_i)} for each of these sets. One quantity we might be interested in is

\displaystyle \Delta = \sup_i | f_n(A_i) - f(A_i)|

and quantitative estimates of {\Delta} are called uniform convergence results.

One surprising fact is that even for infinite collections of sets {\mathcal{A}} we can sometimes still get good bounds on {\Delta} if we can control the complexity of {\mathcal{A}}. Typical ways to quantify this complexity are things like covering numbers, VC dimension etc.

To conclude this detour here is an example of this in action. Let {\mathcal{A}} be the collection of all balls (with any center and radius) in {\mathbb{R}^d}, if {k \geq d \log n} then with probability {1-\delta}, for any {B \in \mathcal{A}}

\displaystyle  \begin{array}{rcl}  f(B) \geq \frac{k}{n} + C \log (1/\delta) \frac{\sqrt{kd \log n}}{n} \implies f_n(B) \geq \frac{k}{n} \\ f(B) \leq \frac{k}{n} - C \log (1/\delta) \frac{\sqrt{kd \log n}}{n} \implies f_n(B) < \frac{k}{n} \end{array}

The inequalities above are uniform convergence inequalities over the sets of all balls in {\mathbb{R}^d}. Although this is an infinite collection of sets it has a small VC dimension of {d+1}, and this lets us uniformly relate the true and empirical mass of each of these sets.

In the context of the CD estimator what this detour assures us is that {k}-NN distance of the every sample point is close to the “true” {k}-NN distance. This lets us show that for an appropriate radius we will not remove any point from a high-density cluster (because it will have a small {k}-NN distance) or that we will remove all points in the low-density region that separates salient clusters (because they will have large {k}-NN distance). Results like this let us establish consistency of the CD estimator.

References.

Chaudhuri, K. and DasGupta, S. (2010). Rates of convergence for the cluster tree. Advances in Neural Information Processing Systems, 23, 343–351.

Hartigan, J. (1981). Consistency of single linkage for high-density clusters. Journal of the American Statistical Association, 76, 388-394.

WHAT IS BAYESIAN/FREQUENTIST INFERENCE?

WHAT IS BAYESIAN/FREQUENTIST INFERENCE?

When I started this blog, I said I wouldn’t write about the Bayes versus Frequentist thing. I thought that was old news.

But many things have changed my mind. Nate Silver’s book, various comments on my blog, comments on other blogs, Sharon McGrayne’s book, etc have made it clear to me that there is still a lot of confusion about what Bayesian inference is and what Frequentist inference is.

I believe that many of the arguments about Bayes versus Frequentist are really about: what is the definition of Bayesian inference?

1. Some Obvious (and Not So Obvious) Statements

Before I go into detail, I’ll begin by making a series of statements.

Frequentist Inference is Great For Doing Frequentist Inference.
Bayesian Inference is Great For Doing Bayesian Inference.

Frequentist inference and Bayesian Inference are defined by their goals, not their methods.

A Frequentist analysis need not have good Bayesian properties.
A Bayesian analysis need not have good frequentist properties.

Bayesian Inference {\neq} Using Bayes Theorem

Bayes Theorem {\neq} Bayes Rule

Bayes Nets {\neq} Bayesian Inference

Frequentist Inference is not superior to Bayesian Inference.
Bayesian Inference is not superior to Frequentist Inference.
Hammers are not superior to Screwdrivers.

Confidence Intervals Do Not Represent Degrees of Belief.
Posterior Intervals Do Not (In General) Have Frequency Coverage Properties.

Saying That Confidence Intervals Do Not Represent Degrees of Belief Is Not a Criticism of Frequentist Inference.
Saying That Posterior Intervals Do Not Have Frequency Coverage Properties Is Not a Criticism of Bayesian Inference.

Some Scientists Misinterpret Confidence Intervals as Degrees of Belief.
They Also Misinterpret Bayesian Intervals as Confidence Intervals.

Mindless Frequentist Statistical Analysis is Harmful to Science.
Mindless Bayesian Statistical Analysis is Harmful to Science.

2. The Definition of Bayesian and Frequentist Inference

Here are my definitions. You may have different definitions. But I am confident that my definitions correspond to the traditional definitions used in statistics for decades.

But first, I should say that Bayesian and Frequentist inference are defined by their goals not their methods.

The Goal of Frequentist Inference: Construct procedure with frequency guarantees. (For example, confidence intervals.)

The Goal of Bayesian Inference: Quantify and manipulate your degrees of beliefs. In other words, Bayesian inference is the Analysis of Beliefs.

(I think I got the phrase, “Analysis of Beliefs” from Michael Goldstein.)

My point is that “using Bayes theorem” is neither necessary or sufficient for defining Bayesian inference. A frequentist analysis could certainly include the use of Bayes’ theorem. And conversely, it is possible to do Bayesian inference without using Bayes’ theorem (as Michael Goldstein, for example, has shown). Let me summarize this point in a table:

Fairly soon I am going to post a review of Nate Silver’s new book. (Short review: great book. Buy it and read it.) As I will discuss in that review, Nate argues forcefully that Bayesian analysis is superior to Frequentist analysis. But then he spends most of the book assessing predictions by how good their frequency properties are. For example, he says that a weather forecaster is good if it rains 95 percent of the times he says there is a 95 percent chance of rain. In others, he loves to use Bayes’ theorem but his goals are overtly frequentist. I’ll say more about this in my review of his book. I use it here as an example of how one can be a user of Bayes theorem and still have frequentist goals.

3. Coverage

An example of a frequency guarantee is coverage. Let {\theta = T(P)} be a function of a distribution {P}. Let {{\cal P}} be a set of distributions. Let {X_1,\ldots, X_n \sim P} be a sample from some {P\in {\cal P}}. Finally, let {C_n = C(X_1,\ldots,X_n)} be a set valued mapping. Then {C_n} has coverage {1-\alpha} if

\displaystyle  \inf_{P\in {\cal P}}P^n( T(P) \in C_n) \geq 1-\alpha

where {P^n} is the {n}-fold product measure defined by {P}.

We say that {C_n} is a {1-\alpha} confidence set if it has coverage {1-\alpha}. A Bayesian {1-\alpha} posterior set will not (in general) have coverage {1-\alpha}. This is not a criticism of Bayesian inference, although anytime I mention this point, some people seem to take it that way. Bayesian inference is about the Analysis of Beliefs; it makes no claims about coverage.

I think there would be much less disagreement and confusion if we used different symbols for frequency probabilities and degree-of-belief probabilities. For example, suppose we used {{\sf Fr}} for frequentist statements and {{\sf Bel}} for degree-of-belief statements. Then the fact that coverage and posterior probability are different would be written

\displaystyle  {\sf Fr}_\theta(\theta\in C_n) \neq {\sf Bel}(\theta \in C_n|X_1,\ldots,X_n).

Unfortunately, we use the same symbol {P} for both in which case the above statement becomes

\displaystyle  P_\theta(\theta\in C_n) \neq P(\theta \in C_n|X_1,\ldots,X_n)

which, I think, just makes things confusing.

Of course, there are cases where Bayes and Frequentist methods agree, or at least, agree approximately. But that should not lull us into ignoring the conceptual differences.

4. Examples

Here are a couple of simple examples.

Example 1. Let {X_1,\ldots, X_n \sim N(\theta,1)\equiv P_\theta} and suppose our prior is {\theta \sim N(0,1)}. Let {B_n} be the equi-tailed 95 percent Bayesian posterior interval. Here is a plot of the frequentist coverage {{\sf Cov}_\theta =P_\theta(\theta\in B_n)} as a function of {\theta}. Note that {{\sf Cov}_\theta} is the frequentist probability that the random interval {B_n} traps {\theta}. ({B_n} is random because it is a function of {X_1,\ldots, X_n}.) Also, plotted is the coverage of the usual confidence interval {C_n=[\overline{X}_n - z_{\alpha/2}/\sqrt{n},\ \overline{X}_n + z_{\alpha/2}/\sqrt{n}]}. This is a constant function, equal to 0.95 for every {\theta}.

Of course, the coverage of {B_n} {{\sf Cov}_\theta} is sometimes higher than {1-\alpha} and sometimes lower. The overall coverage is {\inf_\theta {\sf Cov}_\theta =0} because {{\sf Cov}_\theta} tends to {0} as {|\theta|} increases. At the risk of being very repetitive, this is not meant as a criticism of Bayes. I am just trying to make the difference clear.

Example 2. A {1-\alpha} distribution free confidence interval {C_n} for the median {\theta} of a distribution {P} can be constructed as follows. (This is a standard construction that can be found in any text.) Let {Y_1,\ldots, Y_n \sim P}. Let

\displaystyle  Y_{(1)} \leq Y_{(2)} \leq \cdots Y_{(n)}

denote the order statistics (the ordered values). Choose {k} such that {P(k < B < n-k)\geq 1-\alpha} where {B\sim {\rm Binomial}(n,1/2)}. The confidence interval is {C_n = [Y_{(k+1)},Y_{(n-k)}]}. It is easily shown that

\displaystyle  \inf_P P^n(\theta \in C_n) \geq 1-\alpha

where the infimum is over all distributions {P}. So {C_n} is a {1-\alpha} confidence interval. Here is a plot showing some simulations I did:

The plot shows the first 50 simulations. In the first simulation I picked some distribution {F_1}. Let {\theta_1} be the median of {F_1}. I generated {n=100} observations from {F_1} and then constructed the interval. The confidence interval is the first vertical line. The true value is the dot. For the second simulation, I chose a different distribution {F_2}. Then I generated the data and constructed the interval. I did this many times, each time using a different distribution with a different true median. The blue interval shows the one time that the confidence interval did not trap the median. I did this 10,000 times (only 50 are shown). The interval covered the true value 94.33 % of the time. I wanted to show this plot because, when some texts show confidence interval simulations like this they use the same distribution for each trial. This is unnecessary and it gives the false impression that you need to repeat the same experiment in order to discuss coverage.

How would a Bayesian analyze this problem. The Bayesian analysis of this problem would start with a prior {\pi(P)} on the distribution {P}. This defines a posterior {\pi(P|Y_1,\ldots, Y_n)}. (But the posterior is not obtained via Bayes theorem! There is no dominating measure here. Nonetheless, there is still a well-defined posterior. But that’s a technical point we can discuss another day.) The posterior {\pi(P|Y_1,\ldots, Y_n)} induces a posterior {\pi(\theta|Y_1,\ldots, Y_n)} for the median. And from this we can get a 95 percent Bayesian interval {B_n} say, for the median. The interval {B_n}, of course, depends on the prior {\pi}. I’d love to include a numerical experiment to compare {B_n} and {C_n} but time does not permit. It will make a good homework exercise in a course.

5. Grey Area

There is much grey area between the two definitions I gave. I suspect, for example, that Andrew Gelman would deny being bound by either of the definitions I gave. That’s fine. But I still think it is useful to have clear, if somewhat narrow, definitions to begin with.

6. Identity Statistics

One thing that has harmed statistics — and harmed science — is identity statistics. By this I mean that some people identify themselves as “Bayesians” or “Frequentists.” Once you attach a label to yourself, you have painted yourself in a corner.

When I was a student, I took a seminar course from Art Dempster. He was the one who suggested to me that it was silly to describe a person as being Bayesian of Frequentist. Instead, he suggested that we describe a particular data analysis as being Bayesian of Frequentist. But we shouldn’t label a person that way.

I think Art’s advice was very wise.

7. Failures of Assumptions

I have had several people make comments like: “95 percent intervals don’t contain the true value 95 percent of the time.” Here is what I think they mean. When we construct a confidence interval {C_n} we inevitably need to make some assumptions. For example, we might assume that the data are iid. In practice, these assumptions might fail to hold in which case the confidence interval will not have its advertised coverage. This is true but I think this obscures the discussion.

Both Bayesian and Frequentist inference can fail to achieve their stated goals for a variety of reasons. Failures of assumptions are of great practical importance but they are not criticisms of the methods themselves.

Suppose you apply special relativity to predict the position of a satellite and your prediction is wrong because some of the assumptions you made don’t hold. That’s not a valid criticism of special relativity.

8. No True Value

Some people like to say that it is meaningless to discuss the “true value of a parameter.” No problem. We could conduct this entire conversation in terms of predicting observable random variables instead. This would not change my main points.

9. Conclusion

I’ll close by repeating what I wrote at the beginning: Frequentist inference is great for doing frequentist inference. Bayesian inference is great for doing Bayesian inference. They are both useful tools. The danger is confusing them.

10. Coming Soon On This Blog!

Future posts will include:

-A guest post by Ryan Tibshirani

-A guest post by Sivaraman Balikrishnan

-My review of Nate Silver’s book

-When Does the Bootstrap Work?

-Matrix-Fu, that deadly combination of Matrix Calculus and Kung-Fu.

anti xkcd

anti xkcd

Some of you may have noticed that the recent installment of the always entertaining web comic, xkcd,
had a statistical theme with a decidedly anti-frequentist flavor: see here.

In the interest of balance, here is my
(admittedly crude) attempt at an xkcd style comic.
Right back at you Randall Munroe!

Betting and Elections

Betting and Elections

The winner of yesterday’s election was … statistics.

While bloviating pundits went on and on about how close the election was going to be, some people actually used statistics to forecast the outcome. Perhaps the most famous election quant is Nate Silver. (I’ll be writing more about Nate Silver in a few weeks when I write a post about his book, the signal and the noise. Despite my admiration for Silver, I think he is a bit confused about the difference between Bayesian and frequentist inference. He is a raving frequentist, parading as a Bayesian, as I’ll explain in a few weeks.)

Silver uses all the available polls together with other background information, and then applies statistical methods to combine the information and make predictions. Over at Simply Statistics, there is a nice plot, which I reproduce here:

This plot is from “Simply Statistics”

The plot shows the voting percentage versus Silver’s prediction. Pretty impressive. Of course, Silver wasn’t the only one using statistical methods to make election predictions. See the Washington Post for some more.

Silver caused some controversy when he responded to criticisms of his predictions by offering to bet. Margaret Sullivan, the New York times ombudsman (or, ombudswoman, I guess) criticized Silver for offering the bet. But as Alex Tabarrok argued, offering to bet is a good idea. As Tabarrok puts it:

“A Bet is a Tax on Bullshit”

(This is one of my favorite quotes of the year.) Tabarrok goes on to say: “In fact, the NYTimes should require that Silver, and other pundits, bet their beliefs.”

I agree with this. Imagine if every pundit had to bet part of their salary every time they made a prediction.

I would go a step further and say that every politician should have to put their money where their mouth is. After all, most public policy consists of bets made with other people’s money. If the president thinks that investing in Solyndra is a good bet, then he should have to put up some of his own money.

Betting is a great test of one’s beliefs. I applaud Nate Silver for standing behind his predictions with the offer of a bet. We need more of that.

Edit: My colleague Andrew Thomas has a nice post about this: see
here

Causality Prize

Judea Pearl asked me to post the following announcement:

Dear friends in causality research,

I think you will be pleased to know that the American Statistical Association has announced a new Prize, “Causality in Statistics Education”, aimed to encourage the teaching of basic causal inference in introductory statistics courses.

The motivations for the prize are discussed in an interview I gave to Ron Wasserstein. Nomination procedures and selection criteria can be found here.

I hope readers of this list will participate, either by innovating new tools for teaching causation or by nominating candidates who deserve the prize.

And speaking about education, Bryant and I have revised our survey of econometrics textbooks, see here, and would love to hear your suggestions on how to restore causal inference to econometrics education.

Cheers,

Judea Pearl, UCLA

ps. Please visit our latest results here.

Follow

Get every new post delivered to your Inbox.

Join 950 other followers