Ockham’s Razor

From Friday to Sunday I attended a Philosophy conference on the
Foundations for Ockham’s Razor
. Fellow bloggers Deborah Mayo (a.k.a. the
frequentist in exile) and Cosma Shalizi The conference is organized by Kevin Kelly.

Here is a conference photo (due to B. Kliban):

Cosma and Deborah are giving blow by blow details on their blogs. I’ll just make a few general comments.

The idea of the workshop is to bring together philosophers, statisticians, computer scientists, mathematicians etc to talk about
one of the “principles” that comes up in statistics (and more generally in science), namely, simplicity. In particular, we were
there to discuss the principle called Ockham’s razor.

There was plenty of agreement (at least at this workshop) that nature
can be very complex and yet it is still useful (in many cases) to bias
statistical procedures towards simpler models. Making this precise
usually involves invoking measures of complexity such as: VC
dimension, covering numbers, Rademacher complexity, effective degrees
of freedom and so on. And there are many methods for choosing the
degree of complexity, including: AIC, BIC, cross-validation,
structural risk minimization, Bayesian methods, etc. Nevertheless,
providing a philosophical foundation for all this can be challenging.

Here is a brief summary of the talks. (The full lectures will be
posted online.)

The conference began with a talk by Vladimir
It is hard to overstate Professor Vapnik’s influence in
machine learning and statistics. His fundamental work on uniform laws
of large numbers, as well as support vector machines and
kernelization, is legendary. He talked about an extension of support
vector machines but there were also hints of eastern mysticism and poetry. Next
was Vladimir Cherkassky. (This was the “Russians named Vladimir”
session.) He also talked about VC-based ideas.

Peter Grunwald was next. Since Peter is a statistician (as well as a computer
scientist) I found his talk easier to understand. He talked about
probability-free inference (i.e. uniform over all data sequences).
These are procedures that work well if your hypothesized model (or
prior) is right. But they still have guarantees if your model is
wrong. In fact, they satisfy guarantees even if the data are not
generated by any stochastic mechanism. (I’ll blog about this type of
inference in detail at some point.)

My talk was on examples where standard tuning parameter methods
like cross-validation fail, because they choose overly complex models.
My claim was that, in many cases, we sill don’t have provably good methods
of choosing models with the right degree of complexity.
(Density estimation when the support of the distributions is
concentrated near a low dimensional manifold is an example.)

Elliot Sober talked about notions of parsimony in constructing
phylogenetic trees. There seems to be some disagreement about whether
simplicity of these trees shold be defined in terms of likelihood or
in terms of other measures (like how many muttaions must occur to get
a particular tree).

Unfortunately I missed the talks Saturday morning but I caught the
tail end of Kevin Kelly’s talk. Kevin has spent years developing a
methematical, probability-free account, of the properties of methods
that select “good” models. His work is quite difficult; there are
lots of mathematical details. I think he has some novel ideas but
they are in a language that is hard for statisticians to understand.
I urged Kevin to write a paper for a statistics journal expaining his
theory to statisticians.

Next was Cosma on complex stochastic processes; see his papers for the details.

Hannes Leeb talked about the lack of uniform consistency in model
selection procedures. (The reason for this behavior is familiar if
you remember the Hodges estimator; basically, if a parameter
is of order O(1/\sqrt{n}) from 0, then it is statistically
difficult to distinguish from 0. This screws up uniform consistency.)

Malcolm Forster gave a historical perspective, discussing Ptolemy, Copernicus, Kepler and Tycho Brahe.

Digression: Tycho Brahe is one of my favorite historical characters.
Kevin tells me that the name is pronounced “Tooko Bra.”
Other sources claim it is “Teeko Bra.”
As you may recall, Tycho had a silver nose, having lost his real nose in a duel.
He died from a burst bladder during a drinking binge. He owned a pet
elk (or moose?) which died from drinking too much beer. For more on
this colorful character see here.

The last talk was Deborah Mayo, the frequentist in exile, on “Error
Correction, Severity, and Truth: What’s Simplicity Got to Do With
it?” Deborah argued that scientific inference is about exploring and
probing various hypotheses and cannot be reduced to something as
simple as “choose the simplest model consistent with the data.”
To undertsand her viewpoint, check out her blog or her book.

The conference ended with a round-table discussion. Professor Vapnik
started with long comment about the need to focus on quanities that
can be estimated well (i.e. with uniform probability bounds). He
suggested that statisticians were led astray by Fisher and maximum
likelihood because many things that we are trying to esimate in
statistics are, in a sense, ill-posed. At least, I think that was his
message. I am sympathetic to this viewpoint but felt it was a bit
over-stated. Several of us responded and tried to defend traditional
statistical theory. But is was all very cordial and it was great fun
getting to have a panel discussion with Vladmir Vapik.

We were all supposed to answer the question
“what is Ockham’s razor.”
To me it is just the vague (but useful)
notion that we should not, generally speaking,
fit models that are too complicated.
In other words, don’t overfit.
I don’t think there is anything controversial about that.
But if you try to view it as a more formal
principle, then there are bound to be
differences of opinion.

How would you define Ockham’s razor?

—Larry Wasserman


  1. Posted June 25, 2012 at 10:13 am | Permalink

    In general, I’d say Occam’s Razor is: If E is an event, and S(m) is a theory with m assumptions that explains E, and T(n) is a theory with n assumptions that explains E, then choose S as the candidate theory for explaining E only if m < n. In statistical terms, I'd say Occam's Razor is nonparametric statistics.

  2. Keith O'Rourke
    Posted June 25, 2012 at 12:24 pm | Permalink

    Nice of you to start this blog.
    My favourite definition was the “Superstition of simplicity”.
    Not sure if that was CS Pierce, but he spent a fair amount of time grappling with it.
    He argued that it was what we (had evolved to) recognize as the most natural (least troubling) explanation. I could re-word this as the model that raises the least doubt in our minds.
    But I think the pragmatics (or better worded as the purposefulness) of our models is perhaps a dimension that can confound the discussion getting the least wrong model.
    Keith O’Rourke

  3. Posted June 25, 2012 at 9:49 pm | Permalink

    I define Ockham’s razor as, “Something I turn to whenever I have models that are equally bad or equally good, something I avoid whenever I have a model that performs well (by some performance metric that I am interested in)”

  4. Posted June 25, 2012 at 9:52 pm | Permalink

    Thanks for taking the first step Larry; now I will have to post some of my half-baked notes from the conference and airport. I would not have described this as a “philosophy” conference, even though it was organized by a philosopher. The majority (7 of 11) were in stat, machine learning, computer science. Still, I hope our blogposts on it will trigger some philosophical reflections.

  5. isomorphismes
    Posted June 25, 2012 at 11:43 pm | Permalink

    Even if it can be defined (maximising on two criteria at once) — The simplest model fitting the data is not always the right one. So I’m excited to hear what Dr Mayo has to say.

    Maybe my intuition on this is something like a hockey stick graph in how many principal components to include. “There should be” (it seems as in the abstract, irrespective of any actual understanding of the subject matter) some penalty ruling out extremely flowery models. But other than a vauge sense that “simplicity” (how defined?) “seems good” (again abstractly), this meta-principle is hard to justify.

  6. Christian Hennig
    Posted June 28, 2012 at 10:16 am | Permalink

    Your summary doesn’t mention too much about what simplicity is good for. Was this discussed? My impression is that statisticians and machine learners and perhaps scientists in general focus too much on prediction quality – if you try to fit a too complex model, estimation variance will dominate everything and one gets unreliable predictions. However, optimising prediction quality may still lead to unpleasantly complex models, even apart from selection bias, in the sense that their implications are difficult to understand and that they won’t help much if the specific situation in which prediction quality has been measured changes.
    Simplicity is desirable for all kinds of other reasons, first of all probably because all modelling is essentially about making people’s perceptions and thoughts of the world suitable for communication and human comprehension, which requires simplification at least if we accept the “complex world hypothesis” that the world is never simple enough to be forced into any kind of formal model. Apart from “fitting the data”, a model may serve to communicate how scientists perceive a situation, it may be a basis for discussion, a starting point for creative ideas, or an information compression device. Too complex models may not only predict badly, but they may also distract energy and may harm communication by being ascribed false authority.
    Some of this is not important in some applications, so depending on the situations, different kinds of simplicity may be required and sometimes one may not worry about simplicity too much. Also, unfortunately, some of the relevant aspects of simplicity cannot be easily formalised.
    But I still think that procedures for deciding about simplicity vs. fit or prediction quality should start from making explicit why and what kind of simplicity is desired in a specific situation. Ignoring this by trying to solve the problem in an abstract formal general way will not lead to a convincing “solution”.

    • Posted June 28, 2012 at 11:51 am | Permalink

      Indeed there was much discussion about whether there was too much emphasis
      on pure prediction. And I showed some explicit examples where good predictive models
      were overly complex.

  7. Daniel Baker
    Posted July 9, 2012 at 5:31 pm | Permalink

    I would say that the Minimal Descriptor Length principle is an excellent formalism for Ockham’s Razor. http://www.gersteinlab.org/courses/545/07-spr/slides/mdl.jdu.ppt, http://neurobio.drexelmed.edu/molkovweb/PhysRevE.80.046207.pdf, and, most especially, http://www.ncbi.nlm.nih.gov/pubmed/18437238 by John Dougherty provided a generalization which is, as far as I understand it, free from subjective limitations of standard implementations of MDL.

  8. Peter Nelson
    Posted August 30, 2012 at 11:45 pm | Permalink

    I think Kelly and Schulte’s ideas deserve a little more explanation. C. Shalizi’s summary[1] summarizes them as showing that (some understanding of) Ockham’s Razor is optimal in the sense that you are forced to revise your beliefs the fewest times[2]. The minimal-mind-change direction seems both novel and productive, and it’d be nice to hear what you think about it.

    [1] http://cscs.umich.edu/~crshalizi/weblog/922.html

    [2] I’m putting it roughly of course. There are several different loss functions which have been explored, most of them multi-objective, and the proofs are that “Ockham” strategies are exactly the strategies not strictly dominated by some other strategy. So, like, Pareto optimal or something like that.

2 Trackbacks

  1. […] Shalizi and Larry Wasserman discuss some papers from a conference on Ockham’s Razor. I don’t have anything new to […]

  2. […] Wasserman, on his new blog, Normal Deviate [here], which also has a nice precise of Peter Grunwald’s talk on “Self-repairing Bayesian […]

%d bloggers like this: