Data Science: The End of Statistics?

As I see newspapers and blogs filled with talk of “Data Science” and “Big Data” I find myself filled with a mixture of optimism and dread. Optimism, because it means statistics is finally a sexy field. Dread, because statistics is being left on the sidelines.

The very fact that people can talk about data science without even realizing there is a field already devoted to the analysis of data — a field called statistics — is alarming. I like what Karl Broman says:

*When physicists do mathematics, they don’t say they’re doing “number science”. They’re doing math.*

*
If you’re analyzing data, you’re doing statistics. You can call it data science or informatics or analytics or whatever, but it’s still statistics.*

Well put.

Maybe I am just pessimistic and am just imagining that statistics is getting left out. Perhaps, but I don’t think so. It’s my impression that the attention and resources are going mainly to Computer Science. Not that I have anything against CS of course, but it is a tragedy if Statistics gets left out of this data revolution.

Two questions come to mind:

1. Why do statisticians find themselves left out?

2. What can we do about it?

I’d like to hear your ideas. Here are some random thoughts on these questions. First, regarding question 1.

- Here is a short parable: A scientist comes to a statistician with a question. The statistician responds by learning the scientific background behind the question. Eventually, after much thinking and investigation, the statistician produces a thoughtful answer. The answer is not just an answer but an answer with a standard error. And the standard error is often much larger than the scientist would like.
The scientist goes to a computer scientist. A few days later the computer scientist comes back with spectacular graphs and fast software.

Who would you go to?

I am exaggerating of course. But there is some truth to this. We statisticians train our students to be slow and methodical and to question every assumption. These are good things but there is something to be said for speed and flashiness.

- Generally, speaking, statisticians have limited computational skills. I saw a talk a few weeks ago in the machine learning department where the speaker dealt with a dataset of size 10 billion. And each data point had dimension 10,000. It was very impressive. Few statisticians have the skills to do calculations like this.

On to question 2. What do we do about it?

Whining won’t help. We can complain that that “data scientists” are ignoring biases, not computing standard errors, not stating and checking assumption and so on. No one is listening.

First of all, we need to make sure our students are competitive. They need to be able to do serious computing, which means they need to understand data structures, distributed computing and multiple programming languages.

Second, we need to hire CS people to be on the faculty in statistics department. This won’t be easy: how do we create incentives for computer scientists to take jobs in statistics departments?

Third, statistics needs a separate division at NSF. Simply renaming DMS (Division of Mathematical Sciences) as has been debated, isn’t enough. We need our own pot of money. (I realize this isn’t going to happen.)

To summarize, I don’t really have any ideas. Does anyone?

## 76 Comments

In ” Big Data Science” , at least no one will complain about small sample. Enormous opportunities to verify the Law of Large Numbers and the Central Limit Theorem, if one wants to. To me, it seemed that it’s is not the “End of Statistics” but the beginning of a New Era for the Optimistic Statisticians.

I do not want to come across as trying to say the last word. Any comments are as valuable as some other ones. Long time ago, I admired Betty Scott, Grace Wahba and some others who had the courage to be the Alpha Females.

PSI should also add Nancy Mann too.

Hello,

Me, being a poor statistician, I find difficulties to talk to people with a prestigious background of research in particle physics,

who gave up physics (Higgs boson knows why) to work on Big Data elsewhere.

But stating that all appliances, computers etc… are based on a theory build more than 50 years ago, (even before

the WWII) will certainly put an end to their arrogance :)

I also intend to publish something about econometrics, a science who created many Nobel prize , to make it clear that

mathematical statistics is the same as computing p-values on samples only for the ignorant.

CC

Thank you Larry for launching this thread. As you write: “Whining won’t help” and, by now, we are past the wake up call. So first the facts, as i see them.

1. The “tue” statistician is not alone anymore in developing data analysis methods. Physicists, Industrial Engineers, Biologists, Management and Computer Scientists make important contributions, sometimes reinventing the wheel, in many cases breaking new ground (for example in SNA).

2. The brand “Statistics”: has serious image problems. It is not recognized as a field contributing to discovery but more as policing the work of domain experts in their effort to publish.

3. The communication skills of statisticians are traditionally poor. They use a strange language that others cannot relate to, sometimes getting actually scared by it.

What is the antidote? Two possible directions:

1. Expand the role of statistics to have a life cycle view, starting with problem elicitation of unstructured problems. We need methods and theoretical development on how to capture the goals of marketing experts and translate them into statistical tasks. At the end of the cycle effective communication tools and methods should also be part of the statistics curriculum.

2. Develop tools and methods for impact assessment. Statisticians need to be better at showing the impact of their work. This can be achieved with proper models and impact assessment methodology. This needs to be developed and taught.

3. Improve the generation of knowledge by ensuring information of high quality is derived from a given data set by the analysis done. The concept of InfoQ is a small contribution in this direction. More needs to be done

This thread is discussing a serious problem. The trilogy above is an attempt to deal with it.

I wrote mre about all this in http://ssrn.com/abstract=2171179

http://www.statisticsviews.com/details/feature/4812131/For-survival-statistics-as-a-profession-needs-to-provide-added-value-to-fellow-s.html

I don’t think statisticians are being entirely left out, I’d say the opposite is true. Aside from a statistics course in college, I was truly introduced to statistics through my experience as an investment analyst and fund manager. Here again, we used terms like performance analytics, but relied heavily on the methods of statistics. The CFA likes to term statistics as a part of a more comprehensive category called, “quantitative methods” in their exam curriculum. Now, I have taken from the experience and since moved on from investments (to some degree) to dabble in web development.

I have R programming history (the r-bloggers site brought me here), so I am excited to see packages like Shiny that bring html and JavaScript functionalities with R.

I believe statistics is getting more attention than ever before in history.

An interesting view on this subject in less than 40 characters: https://twitter.com/BigDataBorat/status/372350993255518208

Your problem is that you are a scientist and not a marketing specialist. You believe that somebody can do 10^15 data points, but yourself can only show an error. You were trained to be honest, and they were trained to be dishonest. Computers will not fix that.

## 14 Trackbacks

[…] Data Science: The End of Statistics? […]

[…] [3] http://normaldeviate.wordpress.com/2013/04/13/data-science-the-end-of-statistics/ […]

[…] Larry Wasserman问，数据科学会不会是统计学的终结者呢？媒体铺天盖地的谈论大数据，数据挖掘。人们现在可以谈论数据全然不需要知道这个世界上还有一门学科叫做“统计”。没有误差分析，不需要检验假设的“数据分析”是不是统计的终结？统计该归于计算机还是数学？为什么统计会被边缘化？我们该如何对待？ […]

[…] Larry Wasserman of the Normal Deviate blog has a thought-provoking post on how current trends towards “Big Data” and computationally-intensive analyses are […]

[…] 大数据时代的悄然到来和计算能力爆炸式增长，让做统计分析的各类人士不禁要重新打量一下自己的技能包，看看是不是很快要被时代浪潮以大浪淘沙的方式清洗掉了。 […]

[…] colleagues and I have lately been discussing “Big Data”, and your blog was […]

[…] 大数据时代的悄然到来和计算能力爆炸式增长，让做统计分析的各类人士不禁要重新打量一下自己的技能包，看看是不是很快要被时代浪潮以大浪淘沙的方式清洗掉了。 […]

[…] his part, Larry Wasserman worries about statistics being left out. In “Data Science: The End of Statistics?” he […]

[…] Storing your data and analyzing it on the cloud, be it AWS, Azure, Rackspace or others, is a quantum leap in analysis capabilities. I fell in love with my new cloud powers and I strongly recommend all statisticians and data scientists get friendly with these services. I will also note that if statisticians do not embrace these new-found powers, we should not be surprised if data analysis becomes synonymous with Machine Learning and not with Statistics (if you have no idea what I am talking about, read this excellent post by Larry Wasserman). […]

[…] See discussion by Larry Wasserman at http://normaldeviate.wordpress.com/2013/04/13/data-science-the-end-of-statistics/ (2) H.G. Wells said, “Statistical thinking will one day be as necessary for efficient citizenship […]

[…] alle”. Var ska detta sluta? Kommer statistiker och analytiker att bli överflödiga i framtiden? Statistiker oroar sig. Meningarna går isär, men mycket tyder på att kunskaper i matematik och statistik kommer att bli […]

[…] Storing your data and analyzing it on the cloud, be it AWS, Azure, Rackspace or others, is a quantum leap in analysis capabilities. I fell in love with my new cloud powers and I strongly recommend all statisticians and data scientists get friendly with these services. I will also note that if statisticians do not embrace these new-found powers, we should not be surprised if data analysis becomes synonymous with Machine Learning and not with Statistics (if you have no idea what I am talking about, read this excellent post by Larry Wasserman). […]

[…] de constater que leur sujet les intéresse, et l’inquiétude de ne pas être reconnus comme « ceux qui étaient là avant » et dont les compétences sont assurées. Ils sont les premiers à rappeler que la meilleure […]

[…] http://normaldeviate.wordpress.com/2013/04/13/data-science-the-end-of-statistics/ […]