## Aaronson, COLT, Bayesians and Frequentists

Aaronson, COLT, Bayesians and Frequentists

I am reading Scott Aaronson’s book “Quantum Computing Since Democritus” which can be found here.

The book is about computational complexity, quantum mechanics, quantum computing and many other things. It’s a great book and I highly recommend it. Much of the material on complexity classes is tough going but you can skim over some of the details and still enjoy the book. (That’s what I am doing.) There at least 495 different complexity classes: see the complexity zoo. I don’t know how anyone can keep track of this.

Anyway, there is a chapter on computational learning theory that I wanted to comment on. (There is another chapter about probabilistic reasoning and the anthropic principle which I’ll comment on in a future post.) Scott gives a clear introduction to learning theory and he correctly traces the birth of the theory to Leslie Valiant’s 1984 paper that introduced PAC (probably almost correct) learning. He also contrasts PAC learning with Bayesian learning.

Now I want to put on my statistical curmudgeon hat and complain about computational learning theory. My complaint is this: the discovery of computational learning theory is nothing but the re-discovery of a 100 year old statistical idea called a “confidence interval.”

Let’s review some basic learning theory. Let ${{\cal R}}$ denote all axis aligned rectangles in the plane. We can think of each rectangle ${R}$ as a classifier: predict ${Y=1}$ if ${X\in R}$ and predict ${Y=0}$ if ${X\notin R}$. Suppose we have data ${(X_1,Y_1),\ldots, (X_n,Y_n)}$. If we pick the rectangle ${\hat R}$ that makes the fewest classification errors on the data, will we predict well on a new observation? More formally: is the empirical risk estimator a good predictor?

Yes. The reason is simple. Let ${L(R) = \mathbb{P}(Y\notin R)}$ be the prediction risk and let

$\displaystyle L_n(R) = \frac{1}{n}\sum_{i=1}^n I(Y_i \notin R)$

be the empirical estimate of the risk. We would like to claim that ${\hat R}$ is close to the best classifier in ${{\cal R}}$. That is, we would like to show that ${L(\hat R)}$ is close to ${\inf_{R\in {\cal R}} L(R)}$, with high probability. This fact follows easily if we can show that

$\displaystyle \sup_{R\in {\cal R}} | L_n(R) - L(R)|$

is small with high probability. And this does hold since the VC dimension of ${{\cal R}}$ is finite. Specifically, the key fact is that, for any distribution of the data ${P}$, we have

$\displaystyle P^n\Bigl(\sup_{R\in {\cal R}} | L_n(R) - L(R)| > \epsilon\Bigr) \leq c_1 \exp\left( - c_2 n \epsilon^2 \right)$

for known constants ${c_1}$ and ${c_2}$.

But this is equivalent to saying that a ${1-\alpha}$ confidence interval for ${L(\hat R)}$ is

$\displaystyle C_n = [L_n(\hat R) - \epsilon_n,\ L_n(\hat R) + \epsilon_n]$

where

$\displaystyle \epsilon_n = \sqrt{\frac{1}{n c_2}\log\left(\frac{c_1}{\alpha}\right)}.$

That is, for any distribution ${P}$,

$\displaystyle P^n( L(\hat R) \in C_n)\geq 1-\alpha.$

As Scott points out, what distinguishes this type of reasoning from Bayesian reasoning, is that we require this to hold uniformly, and that there are no priors involved. To quote from his book:

This goes against a belief in the Bayesian religion, that if your priors are different then you come to an entirely different conclusion. The Bayesian starts out with a probability distribution over the possible hypotheses. As you get more and more data, you update this distribution using Bayes’s rule.

That’s one way to do it, but computational learning theory tells us that it’s not the only way. You don’t need to start out with any assumptions about a probability distribution over the hypotheses … you’d like to learn any hypothesis in the concept class, for any sample distribution, with high probability over the choice of samples. In other words, you can trade the Bayesians’ probability distribution over hypotheses for a probability distribution over sample data.

(Note: “hypothesis” = classifier and “concept class” = set of classifiers, “learn” = estimate).

Now, I agree completely with the above quote. But as I said, it is basically the definition of a frequentist confidence interval.

So my claim is that computational learning theory is just the application of frequentist confidence intervals to classification.

There is nothing bad about that. The people who first developed learning theory were probably not aware of existing statistical theory so they re-developed it themselves and they did it right.

But it’s my sense — and correct me if I’m wrong — that many people in computational learning theory are still woefully ignorant about the field of statistics. It would be nice if someone in the field read the statistics literature and said: “Hey, these statistics guys did this 50 years ago!”

Am I being too harsh?

1. Posted May 5, 2013 at 9:25 pm | Permalink

I think the COLT community is well aware of what you write about (I hope, I think..). The main distinguishing factor of PAC-learnability and requiring (e.g.) risk-bounds for some procedures is the requirement for having algorithms with polynomial complexity. If you take away this requirement, you get what we sometimes call the information-theoretic complexity of learning.

• Posted May 6, 2013 at 5:15 pm | Permalink

It’s true that in statistics we tend to ignore the computational complexity of the estimator.
We define an estimator to be any measurable function which really isn’t too useful.
There have been a few recent results about minimax statistical bounds when one restricts
the complexity of the estimator. I hope to blog about that in the future.

• Emre
Posted May 6, 2013 at 11:06 pm | Permalink

May I then suggest this paper for that discussion: http://arxiv.org/abs/1304.0828? The paper also received the best paper award for this year’s COLT.

• Posted May 7, 2013 at 8:06 am | Permalink

That was the one I had in mind.

2. Justin Rising
Posted May 5, 2013 at 9:45 pm | Permalink

The equivalence between learning theory and statistics is more fundamental than that. Modulo some details, if you view the output of a learning algorithm as an estimator of a true classifier, then the class of classifiers that the true classifier belongs to is PAC-learnable if and only if the estimator it generates is asymptotically consistent in the usual statistical sense. This fact does not seem to be very well-appreciated in either the learning theory or statistics community.

• Posted May 22, 2013 at 11:05 pm | Permalink

Can you explain why asymptotic consistency is needed? In the PAC model, I specify epsilon and delta, and I only need a confidence interval of width epsilon with confidence delta. As a machine learning practitioner (not particularly fluent in the theoretical analysis of PAC algorithms), it would seem that once I have the required sample size (which is polynomial in 1/epsilon and 1/delta), I don’t care how the estimator behaves for larger samples.

Is the issue that I can specify epsilon arbitrarily close to zero, and hence drive the required sample size arbitrarily high? This would never be done in practice. I think most machine learning folks would be happy to bound epsilon away from zero.

3. Posted May 5, 2013 at 9:55 pm | Permalink

Larry:

Speaking in general terms, I do think it would be good in general for people to realize that prior distributions help in settings with sparse data and relatively strong prior information (for example, those small-n public health studies that report confidence intervals for odds ratios such as [1.21, 13.93] are pretty ridiculous, given that just about nothing has an odds ratio of 13.93 in the population). With large n and a fixed model, the prior won’t do much.

I haven’t read the book in question but I think expressions such as “the Bayesian religion” (Aaronson’s phrase, not yours) are unhelpful. Bayes is no more of a “religion” than any other statistical method. Lots of Bayesians talk proudly of subjectivity and I don’t think this helps so much either.

The other point that I hope is understood in this community is that confidence intervals are better than they look, in some sense. What I’m getting at is that there are some famous examples where confidence intervals have bad properties–for example, in estimating a ratio where the denominator is of uncertain sign. But often in those settings the quantity that is purportedly being estimated has no real-world meaning. Another famous example is the difficulty of calculating properties of the distribution of a discrete table with known margins. Again, this is a problem with close to zero real applications–it is the solution to a mathematical problem but does not represent any real-world sampling distribution. Thus, some of the problems that most famously stump classical confidence interval and p-value calculations are not, I think, real problems.

4. yaoliang
Posted May 6, 2013 at 5:09 am | Permalink

I think you have neglected the fact that in classical statistics (what you claimed to have done PAC 50 years ago?), confidence intervals are derived using asymptotics (CLT, etc). Whether or not this is a big difference from finite sample bounds depends of course on your flavour. Also, the important notion of the complexity of a hypothesis class (VC dim, etc) does not seem to exist in classical statistics?

• Posted May 6, 2013 at 8:06 am | Permalink

Not all were done using asymptotics.
Remember, Hoeffding was a statistician!

• K. Knight
Posted May 6, 2013 at 11:58 am | Permalink

It depends on how you define “classical statistics”. As a graduate student in the mid-1980s, I learned about VC dimension albeit simply in the context of empirical process theory. But it is true that statisticians have generally been quite slow on the uptake in this area.

5. Posted May 11, 2013 at 10:14 am | Permalink

Just because a group of people studied the first instance of something that is potentially mathematically equivalent doesn’t mean that group started the field. As an extreme (and albeit imperfect) example, the Greeks presumably invented the first notion of an algorithm, but no one is claiming that they started the field of algorithms. PAC learnability is, in some sense, a re-discovery of frequentist confidence intervals. However, PAC learnability also started the field of computational learning theory in a way “just the concept” of confidence intervals never covers (including integrating ideas like complexity of hypothesis classes, etc).

6. Posted June 8, 2013 at 4:16 pm | Permalink

It’s been nearly 20 years since I studied / used CoLT, but let me reiterate Csaba Szepesvari’s comment — CoLT is all about finding polynomial bounds on computational complexity, as a function of 1/error, 1/prob(failure), and problem size. And intractable computational problems are surpisingly easy to find in this field.

For example, consider something as simple as learning a linear threshold function. If perfect discrimination is possible, that’s easily doable in polynomial time. Now let’s modify the problem slightly by making it more realistic: you don’t expect perfect discrimination to be possible, but you’d like to get as close to perfect discrimination as you can. That’s an NP-hard problem; it can’t be done in polynomial time (unless P=NP). OK, you say, I’d be satisfied with coming within some constant factor of the minimum, e.g., no more than 1.2 times the minimum error. That’s still NP-hard, for *any* constant factor, no matter how large. The same goes for the variable selection problem: not only is it NP-hard to minimize the number of regressors needed for perfect discrimination, it’s also NP-hard to come within any constant factor of the minimum.

As an aside, I lost interest in CoLT for two reasons: (1) nearly every interesting problem seemed to be NP-hard, so in practice you always seem to end up using computational methods that lack time or performance guarantees, and (2) I learned about Bayesian methods, which struck me as far more flexible as well as providing a more intellectually productive way of thinking about inference problems.

7. Andy D
Posted June 16, 2013 at 1:35 pm | Permalink

Besides the insistence on computational constraints, another trend that in practice distinguishes much CoLT from much classical statistics is the interest in learning hypotheses drawn from richly-structured discrete hypothesis classes, e.g. DNFs and neural networks. Here there are interesting new algorithms as well as cryptographic-hardness results (for improper learning) and NP-hardness results (for proper learning).

CoLT does not claim to have discovered the basic ideas of non-Bayesian statistics. However, it has explored this space in new ways.