From Friday to Sunday I attended a Philosophy conference on the
Foundations for Ockham’s Razor . Fellow bloggers Deborah Mayo (a.k.a. the
frequentist in exile) and Cosma Shalizi The conference is organized by Kevin Kelly.
Here is a conference photo (due to B. Kliban):
Cosma and Deborah are giving blow by blow details on their blogs. I’ll just make a few general comments.
The idea of the workshop is to bring together philosophers, statisticians, computer scientists, mathematicians etc to talk about
one of the “principles” that comes up in statistics (and more generally in science), namely, simplicity. In particular, we were
there to discuss the principle called Ockham’s razor.
There was plenty of agreement (at least at this workshop) that nature
can be very complex and yet it is still useful (in many cases) to bias
statistical procedures towards simpler models. Making this precise
usually involves invoking measures of complexity such as: VC
dimension, covering numbers, Rademacher complexity, effective degrees
of freedom and so on. And there are many methods for choosing the
degree of complexity, including: AIC, BIC, cross-validation,
structural risk minimization, Bayesian methods, etc. Nevertheless,
providing a philosophical foundation for all this can be challenging.
Here is a brief summary of the talks. (The full lectures will be
The conference began with a talk by Vladimir
Vapnik. It is hard to overstate Professor Vapnik’s influence in
machine learning and statistics. His fundamental work on uniform laws
of large numbers, as well as support vector machines and
kernelization, is legendary. He talked about an extension of support
vector machines but there were also hints of eastern mysticism and poetry. Next
was Vladimir Cherkassky. (This was the “Russians named Vladimir”
session.) He also talked about VC-based ideas.
Peter Grunwald was next. Since Peter is a statistician (as well as a computer
scientist) I found his talk easier to understand. He talked about
probability-free inference (i.e. uniform over all data sequences).
These are procedures that work well if your hypothesized model (or
prior) is right. But they still have guarantees if your model is
wrong. In fact, they satisfy guarantees even if the data are not
generated by any stochastic mechanism. (I’ll blog about this type of
inference in detail at some point.)
My talk was on examples where standard tuning parameter methods
like cross-validation fail, because they choose overly complex models.
My claim was that, in many cases, we sill don’t have provably good methods
of choosing models with the right degree of complexity.
(Density estimation when the support of the distributions is
concentrated near a low dimensional manifold is an example.)
Elliot Sober talked about notions of parsimony in constructing
phylogenetic trees. There seems to be some disagreement about whether
simplicity of these trees shold be defined in terms of likelihood or
in terms of other measures (like how many muttaions must occur to get
a particular tree).
Unfortunately I missed the talks Saturday morning but I caught the
tail end of Kevin Kelly’s talk. Kevin has spent years developing a
methematical, probability-free account, of the properties of methods
that select “good” models. His work is quite difficult; there are
lots of mathematical details. I think he has some novel ideas but
they are in a language that is hard for statisticians to understand.
I urged Kevin to write a paper for a statistics journal expaining his
theory to statisticians.
Next was Cosma on complex stochastic processes; see his papers for the details.
Hannes Leeb talked about the lack of uniform consistency in model
selection procedures. (The reason for this behavior is familiar if
you remember the Hodges estimator; basically, if a parameter
is of order from 0, then it is statistically
difficult to distinguish from 0. This screws up uniform consistency.)
Malcolm Forster gave a historical perspective, discussing Ptolemy, Copernicus, Kepler and Tycho Brahe.
Digression: Tycho Brahe is one of my favorite historical characters.
Kevin tells me that the name is pronounced “Tooko Bra.”
Other sources claim it is “Teeko Bra.”
As you may recall, Tycho had a silver nose, having lost his real nose in a duel.
He died from a burst bladder during a drinking binge. He owned a pet
elk (or moose?) which died from drinking too much beer. For more on
this colorful character see here.
The last talk was Deborah Mayo, the frequentist in exile, on “Error
Correction, Severity, and Truth: What’s Simplicity Got to Do With
it?” Deborah argued that scientific inference is about exploring and
probing various hypotheses and cannot be reduced to something as
simple as “choose the simplest model consistent with the data.”
To undertsand her viewpoint, check out her blog or her book.
The conference ended with a round-table discussion. Professor Vapnik
started with long comment about the need to focus on quanities that
can be estimated well (i.e. with uniform probability bounds). He
suggested that statisticians were led astray by Fisher and maximum
likelihood because many things that we are trying to esimate in
statistics are, in a sense, ill-posed. At least, I think that was his
message. I am sympathetic to this viewpoint but felt it was a bit
over-stated. Several of us responded and tried to defend traditional
statistical theory. But is was all very cordial and it was great fun
getting to have a panel discussion with Vladmir Vapik.
We were all supposed to answer the question
“what is Ockham’s razor.”
To me it is just the vague (but useful)
notion that we should not, generally speaking,
fit models that are too complicated.
In other words, don’t overfit.
I don’t think there is anything controversial about that.
But if you try to view it as a more formal
principle, then there are bound to be
differences of opinion.
How would you define Ockham’s razor?