Steve Marron on “Big Data”

Steve Marron is a statistician at UNC. In his younger days he was well known for his work on nonparametric theory. These days he works on a number of interesting things including analysis of structured objects (like tree-structured data) and high dimensional theory.

Steve sent me a thoughtful email the other day about “Big Data” and, with his permission, I am posting it here.

I agree with pretty much everything he says. I especially like these two gems: First, “a better funded statistical community would be a more efficient way to get such things done without all this highly funded re-discovery.” And second: “I can see a strong reason why it DOES NOT make sense to fund our community better. That is our community wide aversion to new ideas.”

Enough from me. Here is Steve’s comment:

Guest Post, by Steve Marron

My colleagues and I have lately been discussing “Big Data”, and your blog was mentioned.

Not surprisingly you’ve got some interesting ideas there. Here come some of my own thoughts on the matter.

First should one be pessimistic? I am not so sure. For me exhibit A is my own colleagues. When such things came up in the past (and I believe that this HAS happened, see the discussion below) my (at that time senior) colleagues were rather arrogantly ignorant. Issues such as you are raising were blatantly pooh poohed, if they were ever considered at all. However, this time around, I am seeing a far more different picture. My now mostly junior colleagues are taking this very seriously, and we are currently engaged in major discussion as to what we are going to do about this in very concrete terms such as course offerings, etc. In addition, while some of my colleagues think in terms of labels such as “applied statistician”, “theoretical statistician” and “probabilist”, everybody across the board is jumping in. Perhaps this is largely driven by an understanding that universities themselves are in a massive state of flux, and that one had better be a player, or else be totally left behind. But it sure looks better than some of the attitudes I saw earlier on in my career.

Now about the bigger picture. I think there is an important history here that you are totally ignoring. In particular, I view “Big Data” as just the latest manifestation of a cycle that has been rolling along for quite a long time. Actually I have been predicting the advent of something of this type for quite a while (although I could not predict the name, nor the central idea).

Here comes a personally slanted (certainly over-simplified) view of what I mean here. Think back on the following set of “exciting breakthroughs”:

– Statistical Pattern Recognition
– Artificial Intelligence
– Neural Nets
– Data Mining
– Machine Learning

Each of these was started up in EE/CS. Each was the fashionable hot topic (considered very sexy and fresh by funding agencies) of its day. Each was initially based on usually one really cool new idea, which was usually far outside of what folks working in conventional statistics had any hope (well certainly no encouragement from the statistical community) of dreaming up. I think each attracted much more NSF funding than all of statistics ever did, at any given time. A large share of the funding was used for re-invention of ideas that already existed in statistics (but would get a sexy new name). As each new field matured, there came a recognition that in fact much was to be gained by studying connections to statistics, so there was then lots of work “creating connections”.

Now given the timing of these, and how they each have played out, over time, it had been clear to me for some time that we were ripe for the next one. So the current advent of Big Data is no surprise at all. Frankly I am a little disappointed that there does not seem to be any really compelling new idea (e.g. as in neural nets or the kernel embedding idea that drove machine learning). But I suspect that the need for such a thing to happen to keep this community properly funded has overcome the need for an exciting new idea. Instead of new methodology, this seems to be more driven by parallelism and cloud computing. Also I seem to see larger applied math buy-in than there ever was in the past. Maybe this is the new parallel to how optimization has appeared in a major way in machine learning.

Next, what should we do about it? Number one of course is to get engaged, and as noted above, I am heartened at least at my own local level as discussed above.

I generally agree with your comment about funding, and I can think of ways to sell statistics. For example, we should make the above history clear to funding agencies, and point out that in each case there has been a huge waste of resources on people doing a large amount of rediscovery. In most of those areas, by the time the big funding hits, the main ideas are already developed so the funding really just keeps lots of journeymen doing lots of very low impact work, with large amounts of rediscovery of things already known in the statistical community. The sell could be that a better funded statistical community would be a more efficient way to get such things done without all on this highly funded re-discovery.

But before making such a case, I suggest that is it important to face up to our own shortcomings, from the perspective of funding agencies. I can see a strong reason why it DOES NOT make sense to fund our community better. That is our community wide aversion to new ideas. While I love working with statistical concepts, and have a personal love of new ideas, it has not escaped my notice that I have always been in something of a minority in that regard. We not only do not choose to reward creativity, we often tend to squelch it. I still remember the first time I applied for an NSF grant. I was ambitious, and the reviews I got back said the problem was interesting, but I had no track record, the reviewers were skeptical of me, and I did not get funded. This was especially frustrating as by the time I got those reviews I had solved the stated problem. It would be great if that could be regarded as an anomaly of the past when folks may have been less enlightened than now. However, I have direct evidence that this is not true. Unfortunately exactly that cycle repeated itself for one of my former students on this very last NSF cycle.

What should we do to deserve more funding? Somehow we need a bigger tent, which is big enough to include the creative folks who will be coming up with the next really big ideas (big enough to include the folks who are going to spawn the next new community, such as those listed above). This is where research funding should really be going to be most effective.

Maybe more important, we need to find a way to create a statistical culture that reveres new ideas, instead of fearing and shunning them.

Best, Steve



  1. Michael
    Posted May 29, 2013 at 3:13 am | Permalink

    Sorry guys but for me all this post is whining and it can be condensed to one phrase “Give us the money” or maybe little longer phrase “Give us the money because we already know everything that those silly guys just discovered and if we had to create same thing they did we would do it 10 times better. We just didn’t do it because we are so great that we didn’t have time for such petty tinkering.”

  2. Posted May 29, 2013 at 7:52 am | Permalink

    It is hard not to seem like one is whining, but I have noticed this same pattern too.

    My guess was that what most statisticians aspire to do (bring sound mathematical analysis to inference problems) was not (percieved) as the key opportunity for furthering these “exciting breakthroughs”. Today there are many more statisticians who aspire to also produce computational capacities and resources so it might be better for them.

    There is also the sociological stuff required to fit in with the community that are the custodians of what the “exciting breakthroughs” are valued for and many statisticians still feel comfortable making those that bring less than sound mathematical analysis to problems feel very uncomfortable (hence the joke by Jeff Hinton (Neural Nets) that he didn’t know if making friends with a statistician was easier than learning statistics).

    • Entsophy
      Posted May 29, 2013 at 4:11 pm | Permalink

      I confess this sort of stuff rubs me the wrong way as well. I don’t doubt the sincerity or good intentions of Marron or Lasserman one bit, but the idea that statisticians form a tribe and that we should band together to extract tax dollars to benefit the tribe leaves me cold. I don’t naturally feel any tribal affiliation with other statisticians and couldn’t care less whether they ever receive another NSF grant. Why should I?

  3. Posted May 29, 2013 at 10:46 am | Permalink

    “I think each attracted much more NSF funding than all of statistics ever did, at any given time. A large share of the funding was used for re-invention of ideas that already existed in statistics (but would get a sexy new name).”

    Sounds like an opportunity to pick out a statistical idea or two that EE/CS haven’t popularized yet, give it a sexy new name, hire a plant in EE/CS to “discover” it and get funding, and funnel the money back to statistics 🙂

  4. rry223
    Posted May 29, 2013 at 4:38 pm | Permalink

    the truth is somewhere a mathematician/probablist is having a good laugh seeing all the comments on big data/ data science made by so called ‘machine learners’, ‘statisticians’ , ‘neural net guys’ and the likes…

  5. Untenured Critic
    Posted May 29, 2013 at 5:38 pm | Permalink

    If I may ask a potentially rude question — whose fault is it that the Statistics community was overlooked during the birth of these new ideas (or “new” ideas depending on your point of view)?

    It sounds like being denied funding did serve it’s purpose, it lead to some soul searching, as it evidenced by this post.

    I realize that this sounds harsh, and a little bit of “first they cut funding for Statisticians and I didn’t say anything…”. But I do think this discussion is necessary and wouldn’t have happened if the Statistics community was comfortably and securely funded.

  6. Posted May 29, 2013 at 10:10 pm | Permalink

    I don’t know the author of the piece Wasserman is citing, but I’m a little surprised that some describe it as whining or pumping for more stat resources, given that it says: “I can see a strong reason why it DOES NOT make sense to fund our community better. That is our community wide aversion to new ideas.” But the most important question is whether it is true “that in each case there has been a huge waste of resources on people doing a large amount of rediscovery. In most of those areas, by the time the big funding hits, the main ideas are already developed so the funding really just keeps lots of journeymen doing lots of very low impact work, with large amounts of rediscovery of things already known in the statistical community.”

    There seems to be plenty of specific evidence that it is true, nor is this kind of thing the slightest bit surprising to anyone who has been in just about any field for 15-20 years (more so if longer). The drive, in that case, is not for more funding nor self-interested, but caring about the field and caring about a correct understanding and about how to make genuine progress. The idea that statisticians “should make the above history clear to funding agencies” strikes me as an excellent one. It’s just about that time that people may be noticing that there hasn’t been a whole lot of progress in some new-fangled attempts. In this age of austerity, it might be wise to consider that a depth of foundational understanding from which novel ideas can soundly blossom could speed up progress. Just a thought from the philosopher.

    • Posted May 30, 2013 at 12:22 am | Permalink

      “with large amounts of rediscovery of things already known in the statistical community”

      Most fields have been stagnant for 30-60 years or so, while the volume of papers published has increased exponentially. Large numbers of those papers are thinly veiled repackaging of some previous trivial advances. So this “rediscovery” stuff isn’t a phenomenon unique to Statistics. The psychologists could just as easily claim that statisticians are constantly rediscovering their work and request that all future NSF funding be directed to them. Come to think of it, I. J. Good seems to have anticipated almost everything in statistics so maybe NSF grants should only go to him and no one else.

      “I can see a strong reason why it DOES NOT make sense to fund our community better. That is our community wide aversion to new ideas.”

      It’s because of the easy and generous funding that the statistics community is averse to new ideas. Most of the big advances in statistics occurred when funding for statistics was dramatically less than today, and most of it was done by people who weren’t primarily Statisticians. If Statisticians stopped chasing after fads/grants, and returned to those wilder days, it would shake the field up nicely and Statistics would get along just fine. Hell, it might even thrive. Statisticians aren’t exactly helpless babes-in-the-woods when it comes to making money.

    • Posted May 30, 2013 at 8:23 am | Permalink

      There is an unavoidable apparent conflict of interest – its statisticians’ students, colleagues and employers who will get the money.

      And people are unavoidably tribal but I would rather define my tribe in terms of ideas rather than discipline (in the sense of Machiavelli’s virtue).

      Now what Steve claims is maybe true (I certainly believe him) but making these excited folks listen to a statistician put on their grant – probably will do more harm than good.

      An alternative would be to put funding into translation research for transferring statistical understanding to _promising_ faculty from other backgrounds. That Geof Hinton comment earlier about making it easier to learn statistics than having to put up with a statistician and their group think (that all experts are prone to get stuck in).

  7. Posted May 31, 2013 at 1:42 am | Permalink

    I think there is a tendency for statisticians to look at a project in big data or machine learning or whatever and only notice the statistical aspect of the project. Perhaps it would be wiser to dig into these projects and understand what is NOT statistical about them, because those aspects might be what is really leading to success. As a machine learning person, I look at most of the Big Data work and have a hard time finding any machine learning there. It is indeed almost entirely about the challenges of computation, data management, distributed computing that must be solved to apply what are for the most part very simple methods to huge amounts of data. The real intellectual opportunities are to find sampling, hashing, and projection tricks that match the capabilities of today’s computer hardware–that is, to take advantage of the fact that an exact answer may not be required. This is fundamentally cross-disciplinary, but the disciplines are computer systems, randomized algorithms, and perhaps high-dimensional computational geometry.

    One exception to this is the work on deep neural networks, in the sense that it involves both big data and something very interesting from a machine learning/statistics perspective. As far as I can tell, this is an area where experimental algorithms are far ahead of the theory. Perhaps statisticians hold some of the keys to understanding these methods?

  8. Posted June 3, 2013 at 9:09 pm | Permalink

    “NSF funding than *all of statistics* ever did”

    Nice easter egg.

    Steve has a point. Trevor Hastie himself dubbed the SVM as another optimization program and it certainly is, with the exception that its goal is to produce results instead of interpretability as would be the case of a classical linear model, as usual in ML/Data Mining.

  9. rj444
    Posted June 6, 2013 at 8:34 am | Permalink

    I don’t understand the funding complaints. My impression that most biomedical / social science / psychology / epidemiological studies are practically required to also fund statisticians. I don’t think there’s any other field that has this kind of funding guarantee built into the system. Furthermore, while other scientists see their field’s funding go up and down, statisticians are inherently hedged in their institutional role.

    This may also partly explain the conservativeness of the field. Statisticians have become arbiters of legitimacy in the fields involving complex systems where hypotheses tend to be weak and lack mechanistic foundations. To some extent, the exploration of radically new approaches and rethinking of previous approaches tends to undermine the role they play in the institution of research.

    • Posted June 6, 2013 at 8:53 am | Permalink

      That’s not the same as funding for basic research

  10. April
    Posted June 26, 2013 at 10:15 pm | Permalink

    I’m late to the party, but this post popped back up in feedly this week.

    I think I’ve gained a different perspective on this being away from CMU this past year. It’s something close to what tdietterich said about understanding the other parts of the problem besides just the modeling.

    The thing that’s novel about ‘big data’ is how computers can capture actions & interactions that were almost inconceivable before. At least that’s the case in education. Math educators left statistical/psychometric models for qualitative research decades ago because the data collection methods were incapable of capturing what students were actually doing and the details of what students really understood. The problem wasn’t statistical inference, it was the data. Now that we can have students practice things on the computer, and we can log each and every action, the level of inference that’s possible has changed dramatically. At CMU, this isn’t a revelation, Koedinger & Aleven and everybody else have been building this capacity for a long time. But to explain what is and isnt yet possible to people who have been doing narrative/phenomenological/qualitative research their whole careers is extremely difficult, but also extremely exciting.

    This “richer” data makes it possible to answer new sets of questions. And yes, some of the new answers will almost certainly be the same as old answers, but some won’t.

  11. rkenett
    Posted July 15, 2013 at 10:17 am | Permalink

    “Whining won’t help” and, by now, we are past the wake up call. So first the facts, as i see them.
    1. The “tue” statistician is not alone anymore in developing data analysis methods. Physicists, Industrial Engineers, Biologists, Management and Computer Scientists make important contributions, sometimes reinventing the wheel, in many cases breaking new ground (for example in SNA).
    2. The brand “Statistics”: has serious image problems. It is not recognized as a field contributing to discovery but more as policing the work of domain experts in their effort to publish.
    3. The communication skills of statisticians are traditionally poor. They use a strange language that others cannot relate to, sometimes getting actually scared by it.

    What is the antidote? Two possible directions:
    1. Expand the role of statistics to have a life cycle view, starting with problem elicitation of unstructured problems. We need methods and theoretical development on how to capture the goals of marketing experts and translate them into statistical tasks. At the end of the cycle effective communication tools and methods should also be part of the statistics curriculum.
    2. Develop tools and methods for impact assessment. Statisticians need to be better at showing the impact of their work. This can be achieved with proper models and impact assessment methodology. This needs to be developed and taught.
    3. Improve the generation of knowledge by ensuring information of high quality is derived from a given data set by the analysis done. The concept of InfoQ is a small contribution in this direction. More needs to be done

    This thread is discussing a serious problem. The trilogy above is an attempt to deal with it.
    I wrote mre about all this in

    see also

%d bloggers like this: