From Friday to Sunday I attended a Philosophy conference on the

Foundations for Ockham’s Razor . Fellow bloggers Deborah Mayo (a.k.a. the

frequentist in exile) and Cosma Shalizi The conference is organized by Kevin Kelly.

Here is a conference photo (due to B. Kliban):

Cosma and Deborah are giving blow by blow details on their blogs. I’ll just make a few general comments.

The idea of the workshop is to bring together philosophers, statisticians, computer scientists, mathematicians etc to talk about

one of the “principles” that comes up in statistics (and more generally in science), namely, simplicity. In particular, we were

there to discuss the principle called Ockham’s razor.

There was plenty of agreement (at least at this workshop) that nature

can be very complex and yet it is still useful (in many cases) to bias

statistical procedures towards simpler models. Making this precise

usually involves invoking measures of complexity such as: VC

dimension, covering numbers, Rademacher complexity, effective degrees

of freedom and so on. And there are many methods for choosing the

degree of complexity, including: AIC, BIC, cross-validation,

structural risk minimization, Bayesian methods, etc. Nevertheless,

providing a philosophical foundation for all this can be challenging.

Here is a brief summary of the talks. (The full lectures will be

posted online.)

The conference began with a talk by Vladimir

Vapnik. It is hard to overstate Professor Vapnik’s influence in

machine learning and statistics. His fundamental work on uniform laws

of large numbers, as well as support vector machines and

kernelization, is legendary. He talked about an extension of support

vector machines but there were also hints of eastern mysticism and poetry. Next

was Vladimir Cherkassky. (This was the “Russians named Vladimir”

session.) He also talked about VC-based ideas.

Peter Grunwald was next. Since Peter is a statistician (as well as a computer

scientist) I found his talk easier to understand. He talked about

probability-free inference (i.e. uniform over all data sequences).

These are procedures that work well if your hypothesized model (or

prior) is right. But they still have guarantees if your model is

wrong. In fact, they satisfy guarantees even if the data are not

generated by any stochastic mechanism. (I’ll blog about this type of

inference in detail at some point.)

My talk was on examples where standard tuning parameter methods

like cross-validation fail, because they choose overly complex models.

My claim was that, in many cases, we sill don’t have provably good methods

of choosing models with the right degree of complexity.

(Density estimation when the support of the distributions is

concentrated near a low dimensional manifold is an example.)

Elliot Sober talked about notions of parsimony in constructing

phylogenetic trees. There seems to be some disagreement about whether

simplicity of these trees shold be defined in terms of likelihood or

in terms of other measures (like how many muttaions must occur to get

a particular tree).

Unfortunately I missed the talks Saturday morning but I caught the

tail end of Kevin Kelly’s talk. Kevin has spent years developing a

methematical, probability-free account, of the properties of methods

that select “good” models. His work is quite difficult; there are

lots of mathematical details. I think he has some novel ideas but

they are in a language that is hard for statisticians to understand.

I urged Kevin to write a paper for a statistics journal expaining his

theory to statisticians.

Next was Cosma on complex stochastic processes; see his papers for the details.

Hannes Leeb talked about the lack of uniform consistency in model

selection procedures. (The reason for this behavior is familiar if

you remember the Hodges estimator; basically, if a parameter

is of order from 0, then it is statistically

difficult to distinguish from 0. This screws up uniform consistency.)

Malcolm Forster gave a historical perspective, discussing Ptolemy, Copernicus, Kepler and Tycho Brahe.

Digression: Tycho Brahe is one of my favorite historical characters.

Kevin tells me that the name is pronounced “Tooko Bra.”

Other sources claim it is “Teeko Bra.”

As you may recall, Tycho had a silver nose, having lost his real nose in a duel.

He died from a burst bladder during a drinking binge. He owned a pet

elk (or moose?) which died from drinking too much beer. For more on

this colorful character see here.

The last talk was Deborah Mayo, the frequentist in exile, on “Error

Correction, Severity, and Truth: What’s Simplicity Got to Do With

it?” Deborah argued that scientific inference is about exploring and

probing various hypotheses and cannot be reduced to something as

simple as “choose the simplest model consistent with the data.”

To undertsand her viewpoint, check out her blog or her book.

The conference ended with a round-table discussion. Professor Vapnik

started with long comment about the need to focus on quanities that

can be estimated well (i.e. with uniform probability bounds). He

suggested that statisticians were led astray by Fisher and maximum

likelihood because many things that we are trying to esimate in

statistics are, in a sense, ill-posed. At least, I think that was his

message. I am sympathetic to this viewpoint but felt it was a bit

over-stated. Several of us responded and tried to defend traditional

statistical theory. But is was all very cordial and it was great fun

getting to have a panel discussion with Vladmir Vapik.

We were all supposed to answer the question

“what is Ockham’s razor.”

To me it is just the vague (but useful)

notion that we should not, generally speaking,

fit models that are too complicated.

In other words, don’t overfit.

I don’t think there is anything controversial about that.

But if you try to view it as a more formal

principle, then there are bound to be

differences of opinion.

How would you define Ockham’s razor?

—Larry Wasserman

## 9 Comments

In general, I’d say Occam’s Razor is: If E is an event, and S(m) is a theory with m assumptions that explains E, and T(n) is a theory with n assumptions that explains E, then choose S as the candidate theory for explaining E only if m < n. In statistical terms, I'd say Occam's Razor is nonparametric statistics.

Nice of you to start this blog.

My favourite definition was the “Superstition of simplicity”.

Not sure if that was CS Pierce, but he spent a fair amount of time grappling with it.

He argued that it was what we (had evolved to) recognize as the most natural (least troubling) explanation. I could re-word this as the model that raises the least doubt in our minds.

But I think the pragmatics (or better worded as the purposefulness) of our models is perhaps a dimension that can confound the discussion getting the least wrong model.

Keith O’Rourke

I define Ockham’s razor as, “Something I turn to whenever I have models that are equally bad or equally good, something I avoid whenever I have a model that performs well (by some performance metric that I am interested in)”

Thanks for taking the first step Larry; now I will have to post some of my half-baked notes from the conference and airport. I would not have described this as a “philosophy” conference, even though it was organized by a philosopher. The majority (7 of 11) were in stat, machine learning, computer science. Still, I hope our blogposts on it will trigger some philosophical reflections.

Even if it can be defined (maximising on two criteria at once) — The simplest model fitting the data is not always the right one. So I’m excited to hear what Dr Mayo has to say.

Maybe my intuition on this is something like a hockey stick graph in how many principal components to include. “There should be” (it seems as in the abstract, irrespective of any actual understanding of the subject matter) some penalty ruling out

extremelyflowery models. But other than a vauge sense that “simplicity” (how defined?) “seems good” (again abstractly), this meta-principle is hard to justify.Your summary doesn’t mention too much about what simplicity is good for. Was this discussed? My impression is that statisticians and machine learners and perhaps scientists in general focus too much on prediction quality – if you try to fit a too complex model, estimation variance will dominate everything and one gets unreliable predictions. However, optimising prediction quality may still lead to unpleasantly complex models, even apart from selection bias, in the sense that their implications are difficult to understand and that they won’t help much if the specific situation in which prediction quality has been measured changes.

Simplicity is desirable for all kinds of other reasons, first of all probably because all modelling is essentially about making people’s perceptions and thoughts of the world suitable for communication and human comprehension, which requires simplification at least if we accept the “complex world hypothesis” that the world is never simple enough to be forced into any kind of formal model. Apart from “fitting the data”, a model may serve to communicate how scientists perceive a situation, it may be a basis for discussion, a starting point for creative ideas, or an information compression device. Too complex models may not only predict badly, but they may also distract energy and may harm communication by being ascribed false authority.

Some of this is not important in some applications, so depending on the situations, different kinds of simplicity may be required and sometimes one may not worry about simplicity too much. Also, unfortunately, some of the relevant aspects of simplicity cannot be easily formalised.

But I still think that procedures for deciding about simplicity vs. fit or prediction quality should start from making explicit why and what kind of simplicity is desired in a specific situation. Ignoring this by trying to solve the problem in an abstract formal general way will not lead to a convincing “solution”.

Indeed there was much discussion about whether there was too much emphasis

on pure prediction. And I showed some explicit examples where good predictive models

were overly complex.

I would say that the Minimal Descriptor Length principle is an excellent formalism for Ockham’s Razor. http://www.gersteinlab.org/courses/545/07-spr/slides/mdl.jdu.ppt, http://neurobio.drexelmed.edu/molkovweb/PhysRevE.80.046207.pdf, and, most especially, http://www.ncbi.nlm.nih.gov/pubmed/18437238 by John Dougherty provided a generalization which is, as far as I understand it, free from subjective limitations of standard implementations of MDL.

I think Kelly and Schulte’s ideas deserve a little more explanation. C. Shalizi’s summary[1] summarizes them as showing that (some understanding of) Ockham’s Razor is optimal in the sense that you are forced to revise your beliefs the fewest times[2]. The minimal-mind-change direction seems both novel and productive, and it’d be nice to hear what you think about it.

[1] http://cscs.umich.edu/~crshalizi/weblog/922.html

[2] I’m putting it roughly of course. There are several different loss functions which have been explored, most of them multi-objective, and the proofs are that “Ockham” strategies are exactly the strategies not strictly dominated by some other strategy. So, like, Pareto optimal or something like that.

## 2 Trackbacks

[...] Shalizi and Larry Wasserman discuss some papers from a conference on Ockham’s Razor. I don’t have anything new to [...]

[...] Wasserman, on his new blog, Normal Deviate [here], which also has a nice precise of Peter Grunwald’s talk on “Self-repairing Bayesian [...]