Data Science: The End of Statistics?

As I see newspapers and blogs filled with talk of “Data Science” and “Big Data” I find myself filled with a mixture of optimism and dread. Optimism, because it means statistics is finally a sexy field. Dread, because statistics is being left on the sidelines.

The very fact that people can talk about data science without even realizing there is a field already devoted to the analysis of data — a field called statistics — is alarming. I like what Karl Broman says:

*When physicists do mathematics, they don’t say they’re doing “number science”. They’re doing math.*

*
If you’re analyzing data, you’re doing statistics. You can call it data science or informatics or analytics or whatever, but it’s still statistics.*

Well put.

Maybe I am just pessimistic and am just imagining that statistics is getting left out. Perhaps, but I don’t think so. It’s my impression that the attention and resources are going mainly to Computer Science. Not that I have anything against CS of course, but it is a tragedy if Statistics gets left out of this data revolution.

Two questions come to mind:

1. Why do statisticians find themselves left out?

2. What can we do about it?

I’d like to hear your ideas. Here are some random thoughts on these questions. First, regarding question 1.

- Here is a short parable: A scientist comes to a statistician with a question. The statistician responds by learning the scientific background behind the question. Eventually, after much thinking and investigation, the statistician produces a thoughtful answer. The answer is not just an answer but an answer with a standard error. And the standard error is often much larger than the scientist would like.
The scientist goes to a computer scientist. A few days later the computer scientist comes back with spectacular graphs and fast software.

Who would you go to?

I am exaggerating of course. But there is some truth to this. We statisticians train our students to be slow and methodical and to question every assumption. These are good things but there is something to be said for speed and flashiness.

- Generally, speaking, statisticians have limited computational skills. I saw a talk a few weeks ago in the machine learning department where the speaker dealt with a dataset of size 10 billion. And each data point had dimension 10,000. It was very impressive. Few statisticians have the skills to do calculations like this.

On to question 2. What do we do about it?

Whining won’t help. We can complain that that “data scientists” are ignoring biases, not computing standard errors, not stating and checking assumption and so on. No one is listening.

First of all, we need to make sure our students are competitive. They need to be able to do serious computing, which means they need to understand data structures, distributed computing and multiple programming languages.

Second, we need to hire CS people to be on the faculty in statistics department. This won’t be easy: how do we create incentives for computer scientists to take jobs in statistics departments?

Third, statistics needs a separate division at NSF. Simply renaming DMS (Division of Mathematical Sciences) as has been debated, isn’t enough. We need our own pot of money. (I realize this isn’t going to happen.)

To summarize, I don’t really have any ideas. Does anyone?

## 11 Comments

1. Why do statisticians find ourselves left out? Perhaps partly because we market ourselves as the experts on standard errors, confidence intervals, etc. … BUT we haven’t sold people on *why* they should use SEs, CIs, etc. in the first place.

People know to find us if they ever want measures of precision — they just don’t get the need or craving for these measures.

We *tell* them precision is important, but don’t *show* the pitfalls of ignoring it in decision-making.

For example: the Census Bureau spends vast efforts on producing and publishing reliable margins of error for each estimate. But many data users just delete those columns as soon as they download our data. If you’re a local official arguing your hometown needs a grant because your poverty rate is higher than your neighbors’, why bother checking that they’re statistically-significantly different? The grant committee will never care — they’ll just look at the point estimates too. And even if they saw they’re not-significantly-different, what else could they do but use the point estimates?

We all know these things happen anecdotally, but I also know a colleague who’s performing a study about this issue, and so far it confirms this as a common problem. Of course the study results are not significant yet 🙂

2. If so, maybe we need more real-life horror stories of people who acted on a not-significant difference and ran into trouble as a consequence.

(I can think of long-term examples, like bad nutrition advice affecting millions of people slowly, but not many good short-term immediate-impact examples.)

This might encourage more people to hire statisticians, or just encourage them to demand proper SEs from machine learners & data scientists. I’d see either outcome as a success.

This is just an update . Participating in this blog inspired me to learn about the Data Science and I attended a meeting on Big Data Science. This was a session on Apache Hadoop network system that are being used to manage the input- output and processing of big data. As far as I can see, there are opportunities for many applications and studies. However, the statistician needs to learn about HOW the data moves in these intricate systems and where is the best window of opportunity for application of Statistical tools. There are many such systems presently in the industry out there, that are processing Big Data. The only question I have is , how does one quantify the adjective “Big”. So far, without any strict definition of the Big Data in terms of Mega, Tera or Peta bytes, it seems like a subjective choice. But as for the word “Science” , it remains unaffected. Most CS participants are very open about Statistical or Mathematical tools.

Data Science? Surely just a name. ‘Big Data’, ditto.

Someone said “Data science is more than statistics: it also encompasses computer science and business concepts”.

No. Data Science is statistics. It’s just done in a different way – using computers to do things that are now possible because of speed and size. Tack on some iffy projections and there you go.

Business concepts? I don’t think so.

Used by business? Certainly. But then business has never really been in the position of understanding the ‘latest thing’ that the field of computing throws at it. And ‘Data Science’ is just the latest of many manifestations of stuff that has been created and thrown at business with the description of “You’ll need this. It’s the latest thing. It’ll change your life”. It won’t, of course. Any more than the invention of databases did. Or Chaos theory (very tied up with computational error if you really investigate it). Or “Artificial Intelligence”. Or “the cloud”.

I am amazed that anyone thinks that just because a data set is large it should terrify people – or that it should be treated any differently.

All I’m seeing is a bandwagon. And, at some point, the wheels will come off.

(Me – 40+ years in computing, lots of stats and time series analysis in my PhD many years ago. I have seen absolutely nothing ‘new’ in data science at all. I wish I had).

In ” Big Data Science” , at least no one will complain about small sample. Enormous opportunities to verify the Law of Large Numbers and the Central Limit Theorem, if one wants to. To me, it seemed that it’s is not the “End of Statistics” but the beginning of a New Era for the Optimistic Statisticians.

I do not want to come across as trying to say the last word. Any comments are as valuable as some other ones. Long time ago, I admired Betty Scott, Grace Wahba and some others who had the courage to be the Alpha Females.

PSI should also add Nancy Mann too.

Hello,

Me, being a poor statistician, I find difficulties to talk to people with a prestigious background of research in particle physics,

who gave up physics (Higgs boson knows why) to work on Big Data elsewhere.

But stating that all appliances, computers etc… are based on a theory build more than 50 years ago, (even before

the WWII) will certainly put an end to their arrogance 🙂

I also intend to publish something about econometrics, a science who created many Nobel prize , to make it clear that

mathematical statistics is the same as computing p-values on samples only for the ignorant.

CC

Thank you Larry for launching this thread. As you write: “Whining won’t help” and, by now, we are past the wake up call. So first the facts, as i see them.

1. The “tue” statistician is not alone anymore in developing data analysis methods. Physicists, Industrial Engineers, Biologists, Management and Computer Scientists make important contributions, sometimes reinventing the wheel, in many cases breaking new ground (for example in SNA).

2. The brand “Statistics”: has serious image problems. It is not recognized as a field contributing to discovery but more as policing the work of domain experts in their effort to publish.

3. The communication skills of statisticians are traditionally poor. They use a strange language that others cannot relate to, sometimes getting actually scared by it.

What is the antidote? Two possible directions:

1. Expand the role of statistics to have a life cycle view, starting with problem elicitation of unstructured problems. We need methods and theoretical development on how to capture the goals of marketing experts and translate them into statistical tasks. At the end of the cycle effective communication tools and methods should also be part of the statistics curriculum.

2. Develop tools and methods for impact assessment. Statisticians need to be better at showing the impact of their work. This can be achieved with proper models and impact assessment methodology. This needs to be developed and taught.

3. Improve the generation of knowledge by ensuring information of high quality is derived from a given data set by the analysis done. The concept of InfoQ is a small contribution in this direction. More needs to be done

This thread is discussing a serious problem. The trilogy above is an attempt to deal with it.

I wrote mre about all this in http://ssrn.com/abstract=2171179

http://www.statisticsviews.com/details/feature/4812131/For-survival-statistics-as-a-profession-needs-to-provide-added-value-to-fellow-s.html

I don’t think statisticians are being entirely left out, I’d say the opposite is true. Aside from a statistics course in college, I was truly introduced to statistics through my experience as an investment analyst and fund manager. Here again, we used terms like performance analytics, but relied heavily on the methods of statistics. The CFA likes to term statistics as a part of a more comprehensive category called, “quantitative methods” in their exam curriculum. Now, I have taken from the experience and since moved on from investments (to some degree) to dabble in web development.

I have R programming history (the r-bloggers site brought me here), so I am excited to see packages like Shiny that bring html and JavaScript functionalities with R.

I believe statistics is getting more attention than ever before in history.

An interesting view on this subject in less than 40 characters: https://twitter.com/BigDataBorat/status/372350993255518208

Your problem is that you are a scientist and not a marketing specialist. You believe that somebody can do 10^15 data points, but yourself can only show an error. You were trained to be honest, and they were trained to be dishonest. Computers will not fix that.

## 11 Trackbacks

[…] Larry Wasserman of the Normal Deviate blog has a thought-provoking post on how current trends towards “Big Data” and computationally-intensive analyses are […]

[…] 大数据时代的悄然到来和计算能力爆炸式增长，让做统计分析的各类人士不禁要重新打量一下自己的技能包，看看是不是很快要被时代浪潮以大浪淘沙的方式清洗掉了。 […]

[…] colleagues and I have lately been discussing “Big Data”, and your blog was […]

[…] 大数据时代的悄然到来和计算能力爆炸式增长，让做统计分析的各类人士不禁要重新打量一下自己的技能包，看看是不是很快要被时代浪潮以大浪淘沙的方式清洗掉了。 […]

[…] his part, Larry Wasserman worries about statistics being left out. In “Data Science: The End of Statistics?” he […]

[…] Storing your data and analyzing it on the cloud, be it AWS, Azure, Rackspace or others, is a quantum leap in analysis capabilities. I fell in love with my new cloud powers and I strongly recommend all statisticians and data scientists get friendly with these services. I will also note that if statisticians do not embrace these new-found powers, we should not be surprised if data analysis becomes synonymous with Machine Learning and not with Statistics (if you have no idea what I am talking about, read this excellent post by Larry Wasserman). […]

[…] See discussion by Larry Wasserman at https://normaldeviate.wordpress.com/2013/04/13/data-science-the-end-of-statistics/ (2) H.G. Wells said, “Statistical thinking will one day be as necessary for efficient citizenship […]

[…] alle”. Var ska detta sluta? Kommer statistiker och analytiker att bli överflödiga i framtiden? Statistiker oroar sig. Meningarna går isär, men mycket tyder på att kunskaper i matematik och statistik kommer att bli […]

[…] Storing your data and analyzing it on the cloud, be it AWS, Azure, Rackspace or others, is a quantum leap in analysis capabilities. I fell in love with my new cloud powers and I strongly recommend all statisticians and data scientists get friendly with these services. I will also note that if statisticians do not embrace these new-found powers, we should not be surprised if data analysis becomes synonymous with Machine Learning and not with Statistics (if you have no idea what I am talking about, read this excellent post by Larry Wasserman). […]

[…] de constater que leur sujet les intéresse, et l’inquiétude de ne pas être reconnus comme « ceux qui étaient là avant » et dont les compétences sont assurées. Ils sont les premiers à rappeler que la meilleure […]

[…] https://normaldeviate.wordpress.com/2013/04/13/data-science-the-end-of-statistics/ […]