THE FIVE: Jeff Leek’s Challenge

Jeff Leek, over at Simply Statistics asks an interesting question: What are the 5 most influential statistics papers of 2000-2010?

I found this to be incredibly difficult to answer. Eventually, I came up with this list:

Donoho, David (2006). Compressed sensing. IEEE Transactions on Information Theory. 52, 1289-1306.

Greenshtein, Eitan and Ritov, Ya’Acov. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli, 10, 971-988.

Meinshausen, Nicolai and Buhlmann, Peter. (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34, 1436-1462.

Efron, Bradley and Hastie, Trevor and Johnstone, Iain and Tibshirani, Robert. (2004). Least angle regression. The Annals of statistics, 32, 407-499.

Hofmann, Thomas and Scholkopf, Bernhard and Smola, Alexander J. (2008). Kernel methods in machine learning. The Annals of Statistics. 1171–1220.

These are all very good papers. These papers had a big impact on me. More precisely, they are representative of ideas that had an impact on me. It’s more like there are clusters of papers and these are prototypes from those clusters. I am not really happy with my list. I feel like I must be forgetting some really important papers. Perhaps I am just getting old and forgetful. Or maybe our field is not driven by specific papers.

What five would you select? (Please post them at Jeff’s blog too.)


  1. Emil Gilels
    Posted July 23, 2013 at 9:21 pm | Permalink

    For sure I’d add:

    Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. the Journal of machine Learning research, 3, 993-1022.

  2. Dr. Abhijit Kulkarni
    Posted July 24, 2013 at 8:05 am | Permalink

    Hello Dr Wasserman,

    Here is my list (sorted descending by their citation (from Google Scholar) count):

    1: Breiman, Leo. “Random forests.” Machine learning 45.1 (2001): 5-32.(9469 citations)

    2: Efron, Bradley and Hastie, Trevor and Johnstone, Iain and Tibshirani, Robert. (2004). “Least angle regression”. The Annals of statistics, 32, 407-499 (3459 citations)

    3: Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). “Gene selection for cancer classification using support vector machines”. Machine learning, 46(1-3), 389-422. (3345 Citations)

    4: Muller, K. R., Mika, S., Ratsch, G., Tsuda, K., & Scholkopf, B. (2001). “An introduction to kernel-based learning algorithms”. Neural Networks, IEEE Transactions on, 12(2), 181-201. (2440 Citations)

    5: Friedman, J. H. (2002). “Stochastic gradient boosting”. Computational Statistics & Data Analysis, 38(4), 367-378. (682 citations)

    Apart from citation count, i feel these papers are unique in the sense that they present fresh ideas and the associated algorithms are tested on many real life challenging problems with lot of success.


  3. Christian Hennig
    Posted July 24, 2013 at 8:41 am | Permalink

    Thinking about “influential” papers is quite different from thinking about papers that I like and that had a meaning to me personally. The former requires much more knowledge about who is doing what and how others react to it. Just to nominate two papers that I personally liked a lot:
    Tyler, D., Critchley, F., Dumbgen, L., and Oja, H. and Tyler, D. (2009) Invariant coordinate selection (with discussion). Journal of Royal Statistical Society B, {71}, 549–592.
    Claeskens, G. & Hjort, N.L. (2003). The Focussed Information Criterion, Journal of the American Statistical Association, 98, 900-916 (with discussion)/Hjort, N.L. & Claeskens, G. (2003). Frequentist model average estimators, Journal of the American Statistical Association, 98, 879-899 (with discussion). (These can be counted as a single “double paper”.

  4. Posted July 25, 2013 at 9:55 am | Permalink

    When it comes to regression, the choices maybe depend on how far to the machine-learning side of the spectrum you’re looking. The two papers in this list on new regression methods, the lasso and LARS, are more solidly on the statistical-community side of things. But I think there’s a decent argument than in terms of broad usage and general impact on science, the two most influential new regression methods of the 00’s were Breiman’s random forests (Machine Learning 45(1), 2001), and Friedman’s gradient-boosting machines (Annals of Statistics 29(5), 2001). The R packages ‘randomForest’ and ‘gbm’ often end up as default off-the-shelf choices for high-dimensional regression.

%d bloggers like this: