Data Science: The End of Statistics?

As I see newspapers and blogs filled with talk of “Data Science” and “Big Data” I find myself filled with a mixture of optimism and dread. Optimism, because it means statistics is finally a sexy field. Dread, because statistics is being left on the sidelines.

The very fact that people can talk about data science without even realizing there is a field already devoted to the analysis of data — a field called statistics — is alarming. I like what Karl Broman says:

*When physicists do mathematics, they don’t say they’re doing “number science”. They’re doing math.*

*
If you’re analyzing data, you’re doing statistics. You can call it data science or informatics or analytics or whatever, but it’s still statistics.*

Well put.

Maybe I am just pessimistic and am just imagining that statistics is getting left out. Perhaps, but I don’t think so. It’s my impression that the attention and resources are going mainly to Computer Science. Not that I have anything against CS of course, but it is a tragedy if Statistics gets left out of this data revolution.

Two questions come to mind:

1. Why do statisticians find themselves left out?

2. What can we do about it?

I’d like to hear your ideas. Here are some random thoughts on these questions. First, regarding question 1.

- Here is a short parable: A scientist comes to a statistician with a question. The statistician responds by learning the scientific background behind the question. Eventually, after much thinking and investigation, the statistician produces a thoughtful answer. The answer is not just an answer but an answer with a standard error. And the standard error is often much larger than the scientist would like.
The scientist goes to a computer scientist. A few days later the computer scientist comes back with spectacular graphs and fast software.

Who would you go to?

I am exaggerating of course. But there is some truth to this. We statisticians train our students to be slow and methodical and to question every assumption. These are good things but there is something to be said for speed and flashiness.

- Generally, speaking, statisticians have limited computational skills. I saw a talk a few weeks ago in the machine learning department where the speaker dealt with a dataset of size 10 billion. And each data point had dimension 10,000. It was very impressive. Few statisticians have the skills to do calculations like this.

On to question 2. What do we do about it?

Whining won’t help. We can complain that that “data scientists” are ignoring biases, not computing standard errors, not stating and checking assumption and so on. No one is listening.

First of all, we need to make sure our students are competitive. They need to be able to do serious computing, which means they need to understand data structures, distributed computing and multiple programming languages.

Second, we need to hire CS people to be on the faculty in statistics department. This won’t be easy: how do we create incentives for computer scientists to take jobs in statistics departments?

Third, statistics needs a separate division at NSF. Simply renaming DMS (Division of Mathematical Sciences) as has been debated, isn’t enough. We need our own pot of money. (I realize this isn’t going to happen.)

To summarize, I don’t really have any ideas. Does anyone?

## 65 Comments

Cs is the first or second highest paid academic field so it may be difficult to recruit top staff. However, you may not need top staff to get excellent teachers for your graduate students. Joint appointments may also be a possibility since CMU has such a strong department that might be an attractive offer.

Demonstrations of real world situations where machine learning leads to inferior predictions is probably the best way to persuade people to listen to statisticians more generally.

That all said, when I took your mathematical statistics class ten years ago there were many CS students enrolled. Maybe the easiest thing to do is have students take machine learning classes in the CS department. They can bring what works into their own research and practice.

I’d say it comes down to market a product to be sold to CTOs in companies. If you knock the door of one company saying I have a product that does statistics you might get a “we already have that” answer.

butif you say “Oh, no, you have ‘Business Intelligence’ I am selling you ‘Analytics’ totally different”, and a couple of years aftewards “Oh, no, you have ‘Analytics’, I am selling you ‘Big Data'” totally different” well, in this case you keep selling the same product (statistics) with different names in a regular basis.So

1. Why do statisticians find themselves left out?

2. What can we do about it?

I’d say it comes down to market a product to be sold to CTOs in companies. If you knock the door of one company saying I have a product that does statistics you might get a “we already have that” answer.

butif you say “Oh, no, you have ‘Business Intelligence’ I am selling you ‘Analytics’ totally different”, and a couple of years afterwards “Oh, no, you have ‘Analytics’, I am selling you ‘Big Data'” totally different” well, in this case you keep selling the same product (statistics) with different names in a regular basis. So1. Why do statisticians find themselves left out? Bad marketing.

2. What can we do about it? Good marketing.

Most managers in the private sector do not care about how things work, only if they work, and often they are happy with the product giving the illusion it works. So we can tell them about thousands of marginal benefits using this or that but they only care about how good the new product is going to look in his/her the next powerpoint presentation and if it will do a decent job so that he/she looks just as good in front of his/her boss.

So maybe you need to hire experts in Marketing rather than in CS for the statistics department but, if you still insist, we have a 26% unemployment rate in Spain… just saying 😛

PS: I still hate not being able to delete/edit replies in WordPress.

Labels don’t matter. If there is value to society in a discipline there will be role for it no matter what it is called and no matter how much other disciplines encroach on its focus. Once the discipline stops adding value, it withers away. I know nothing of math, computer science, statistics or Latin but I feel confident that statisticians are safe for as long as we have a society.

Welcome to the blog Sean!

I like your optimistic attitude.

You are right that statistics will not disappear.

But we statisticians need to do a better job of marketing ourselves.

We need a Carl Sagan.

(Jokes about billions of data points aside.)

We all need to be sales people. If we want people to follow us and there is another choice as to who to follow, we need to sell what we have to offer. The key is finding the sales style that fits your personality, audience and wares that you sell.

No matter what one calls, Statistics remains Statistics and it is described by that very name. Numbers, data, data science, analytical science etc. etc. It is not popular because people do not understand and do not like the role of

” errors” associated with ” numbers” . It does not click. But, That is the beauty of Statistics, which attracted me to become a Statistician. Mathematics is an exact science, no room for any € errors. It’s fundamental. But, Statistics is not. It studies the errors. Math is pure, Stat is real. And they cannot live without each other.

Try to sell Statistics? Just not possible. So, if Data Science or Informatics can, so be it. Still they are learning that same thing.

So, what’s in a name? Call Statistics by any name, it’s still the same.

This is a topic that really hits home.

Full disclosure: I started out in Prob./Stat, and was one of the founders of the Stat. Dept. at my university, UC Davis. Around the same time I became really interested in CS, and joined the new CS Dept. (A deal to have a joint appointment fell through for complex reasons.) I’ve spent most of my career in CS, so I’ve been the one on the sidelines in terms of statistics. I still use statistics in much of my research (though not the fancy stuff), and consider myself a statistician. (Why else would I be reading this blog?)

Second full disclosure: I’m currently writing a book with “Data Science” in the title, “Parallel Computing for Data Science.” I have misgivings about the term, but find it useful.

Having said all that, I agree strongly with Larry’s points. And no, he is not imagining it at all. Just the other day, I was told by a user of “data science” in the business community, “With data science, one doesn’t have to use statistics at all.” He was of course referring to “machine learning” methods that are essentially just forms of nonparametric curve estimation, a topic founded by statisticians 50 years ago. This person seemed to associate “statistics” with parametric models; the nonparametric stuff is CS to him, not Stat..

I think the quality of the statistics work done in CS departments is generally awful. Most of it takes an “engineering” point of view, by which I mean, “Well, we tried this and that, and some parts did OK on some data sets,” without even a basic notion of even just what it is we’re trying to estimate. No model, no recognition of sampling. They use cross-validation a lot, implying there is randomness in the data, but have no idea as to what that might mean.

For me, all this was epitomized in an incident in which a new PhD in CS was interviewing for a faculty position in my (CS) department. The PhD had been earned at a school that is easily “top 5,” actually much better, in both CS and Stat., on a topic in genomics. The speaker had used a Bayesian analysis [sorry to bring up the topic again 🙂 ]. Fine, but when asked why that particular prior had been used, the speaker could not answer, and in fact was puzzled why anyone would even care.

I think this Data Science thing has brought to a head a fundamental problem that Stat. has always had: A bad image. Xiao-Li Meng has written a lot on this, of course. The very name sounds boring. In my humble opinion, AP Statistics in high schools has exacerbated the problem, as it is taught in uninspiring form, often by teachers who don’t know the subject beyond what’s in the basic course. So “we” in Stat. have allowed “them” in CS to usurp the field, in the sense of taking over the applications and the attention.

It’s always been interesting to me that the engineers and scientists have a particularly low opinion of Stat., a fascinating consequence of which is that none of the top tech schools–MIT, Caltech, Georgia Tech–has a Stat. Dept.

As to having CS people in Stat. Depts., in my observation that hasn’t worked. Though for obvious reasons I need to avoid specific examples, from what I’ve seen computing-oriented people in Stat. Depts. have not been well received.

End of rant. 🙂

Interesting observation about MIT, Caltech and Georgia Tech

I am the CTO at an an analytics company and employer of a data scientist. I think you are being overly alarmist. Our minimum requirement for the position was a Masters or PhD degree in statistics. If memory serves, we didn’t have many CS grads applying, on the other hand we did have quite a few Physics grad applicants displaying that field’s common arrogance (“physics is hard, therefore everything else is easy”), most of whom were blissfully unaware of basic concepts such as overfitting and cross-validation.

That said, data science is 80% about the logistics of collecting data from disparate sources, cleansing it and being able to process it at scale, and only 20% applying statistical techniques. In other words, it is a practicum, just as what used to be called Operations Research was one for discrete mathematics. Yes, young statisticians in training should be exposed and brought to proficiency in tools like SQL, R and the ability to write number-crunching scripts in Python/Perl/whatever, but I believe this is largely the case in most stats programs across the country.

In terms of raising awareness of the risks of poorly applied cookbook-style folk statistics, there are a number of widely read statistician bloggers like Prof. Shalizi who do an excellent job of debunking poor analysis and gross errors of interpretation. Surprise, surprise, they are often made by physicists or economists, and show the Dunning-Kruger effect in all its glory. Separating those articles from the more technical ones could help, perhaps compiled in a “Journal of data science epic failure”?

I am curious as to why you required a degree in statistics, especially if 80% of the job is actually data wrangling and large scale computing, which is probably on the whole done by CS people.

I wish I could it.

I meant to say done _better_ by CS people.

There is not much science to data wrangling and large scale computing, so you hardly need someone with huge CS theory background. Even if only 20 % of the job requires a strong stats background, that’s a lot harder to pick up than the programming side of it. Grad degree in CS doesn’t tell you too much about programming skill anyway.

Kudos to Dr. Dr.Matloff for describing a scenario familiar to some of us Statisticians who ventured into teaching, research and also training in Industries with a high hopes of appreciations. Amazingly, the engineers and scientists including physicists and astronomers, have to and often do agree on how important Statistics with its uncertainty, probability and stochastic models are in their pursuit of scientific ventures. But, it’s considered only as a necessary

“tool’ like a software. I am retired now, but I had found that the only time a student got interested, if a software was used to handle and to find the conclusion of a problem. That saves a student from ‘learning’ the intricacies of the methodologies.

Dr. Matloff has also made a passing remarks on Bayesian. I know, how important it is for regular ‘Classical’ statisticians to give a proper place to the Bayesians. It is like the ” Parameteric” vs the ” Non-Parameteric ” situation within the Statisticians. But, other than generating mathematically intractable functions, Bayesian methods are the best way to capture the essence of the variabilities in everything. We just cannot assume any parameter to be a constant quantity over a long period of time. Even the Universe is changing from its creation point. like it or not , there is always a probability associated with a parameter. Only perhaps not in a situation, where a very short time is observed, where a parameter can be considered as a constant quantity.

Now, with super fast computers, those intractable functions are no longer needed to appear in analytically closed forms. Advanced Stochastic models with techniques such as MCMC have done wonders in resolving complicated mathematical problems. In this case CS area has done wonders. Hopefully, in collaboration with some Statisticians.

If this was the case about 50 years ago, I did not have to bear the humiliation for so many years, that, I was not able to solve an Integral for a specific Posterior Probabilty in an analytically closed form in my thesis.

” After all women are limited in their Mathematical abilities”

was the conclusion of one of the committee members.

According to my humble opinion, the time has come that we embrace the technology and try to integrate the Math, Statistics and CS to the benefit of each or create a new area combining all these three indispensable disciplines so students enrolling have to study all these. In this dream department, Statistics is not as an Optional topic, but an essential one to graduate.

My two cents worth.

Sumedha

Fazal makes some good points, but with the recent rise in academic programs in Data Science, Analytics and the like, I think we’ll see fewer employers like him over time. Actually, in our Bay Area R Users group, I meet a lot of people who do “analytics,” without Stat degrees. For example, there is a husband-and-wife team who formed their own successful analytics consulting firm, and they both have PhDs in CS–ironically from CMU, Larry’s own institution.

I met another guy in our R group with an engineering degree from a top school but essentially no Stat background, and guess what–he’s doing analytics in a high-level position for one of the major social network firms. Tellingly, he expressed shock at finding that “all tests turn out to be significant” in his firm’s huge data sets.

Sumedha, I’m not a Bayesian, but I do believe that any analysis, frequentist or Bayesian, should be well thought out. I agree with your other points.

Dr. Matloff,

Which other points! May I ask?

I hope you do not agree with the remarks made by the committee member during my thesis defense? To me, that remark Should be treated as a Null Hypothesis to be tested against an appropriate Alternative, with a statistically correct sample and with a high confidence.

Perhaps future will establish its rejection.

I agree with your opinion. Yes, it needs an in depth study of any problem at hand to proceed with the correct method.

Everything, especially the outcome depends on that.

Spending my whole life with another Ph. D. Statistician ( of UCB who studied under towering figures like, Lehman, Scott, Blackwell, Loeve, Neyman etc) with pure mathematics background, has been very interesting. We have perhaps covered all possible topics in Math and Stat. ( and Computer science) in our daily debates. I have studied and worked always as an Applied Statistician.

I am a Bayesian and he is not. So, the debates have been survival struggle sometimes. I just cannot get by with any flaw in the subject or in my argument. Learning went both ways. He finally used Bayesian methods and MCMC to get good results that could not otherwise.

But we both agree that Statistics cannot be left anymore to the whims of non statisticians. It is time to defend it to become a mandatory subject to be taught. It just cannot survive as a ‘second class’ math topic that every one needs.

Whether one needs it 10% or 20% is not the reason. If one studies it the way it should be studied, then it will be applied properly no matter what percentage of time one does it. I have worked in the Industries, and except for very few top companies, no one hires Statisticians. However, others with some Stat is preferred. There is a notion, Ph.D. Statisticians are only good for teaching. Even in research projects, they have to work as a member of a team, can never get an independent fundings, except very few may be.

Qualifications and experience are not always sought for, by the jobs or funds givers.

Sorry to say that.

Sumedha, I agree that the physicists et al just treat statistics as a minor tool. I actually had thought of making the same analogy. I also agree that high-speed computation has changed everything; again, that doesn’t bother me as long as the model is well-specified and understood. I especially like your later phrasing, “Statistics cannot be left anymore to the whims of non statisticians” (provided of course that “statisticians” includes people who have a reasonable knowledge and understanding of the issues, whether via a formal degree or not).

In a masters program in a math dept with some statistics specialists. There seems to be little interest within the department (and perhaps little knowledge) of the rise of ‘data science’ and the related fields like ‘machine learning.’ That stuff is ‘computational’ and ‘applied.’ Not the kind of stuff that traditional math students should be pursuing. So at least in this corner of academia, there is still lots of bridge building to do between the CS and Math worlds.

I am intrigued by your (and many other statisticians’) comment that Stat doesn’t have enough pot of money from NSF. I have never understood this – why can’t stat people just apply to the same divisions/agencies that fund “big data” research? Do you really believe that if any statistician working on high-dimensional data (for example) tried applying to say IIS instead of DMS with a good idea, it won’t get funded? I doubt it. Perhaps the culprit is “less pressure to write grants” in Stat Departments.

About your second point, I think it is indeed becoming easier for Stat to hire CS. Ironically, the incentive I have heard is “less pressure to write grants”.

PhD student in statistics, heading to Google. I wholeheartedly agree with you on the need to boost computational training for statistics students, and the points you picked out (distributed computing and data structures) seem on-target. However, based on my experience, I’m more concerned about the role of standard errors in your parable.

I’ve been through projects far closer to your parable than I would have liked—especially the “And the standard error is often much larger than the scientist would like” part. In the sciences, the combination of flashy, complex methods with less rigorous error evaluations can be very appealing; dazzle enough, obfuscate your error properties (deliberately or incidentally), and your chances of making a big splash can improve dramatically. Your conclusions might be reversed later, but, why worry about that? These seem to be more systemic issues than statisticians can address alone.

And, of course, the necessary disclaimer: not the views of my department or Google.

Hi

It is statistics. Whatever fancy name they give it. Big data whatever. It is about crunching numbers, doing analysis and making conclusions. To reach a certain goal. Like making more sales, more profit or helping children.

That last part, pure statisticians are not very good in. Translating a difficult number like standard error to the layman.

That’s why they are left out of the discussion. Because when someone wants to take a decision, whether a CEO, prime minister, he wants a number he can rely on. He wants complex things to be made easy. Statisticians need to explain things more easily. Otherwise he will go to the computer guy.

In a world which is more and more quantitative ( Google makes everything in numbers) statisticians need to be happy. This is becomming a magical world for them. Numbers all over the place. Facebook, Twitter, Google, Big data. How can you complain? It’s like a boy walking in a candy shop. Your dreams came true.

But the statistician has a problem. Because does he have the toolset to crunch and relate these type of numbers?

I would say work with the computer guy. Let he teach you how to get data out of all the databases and systems. That’s his field. Take a course on databases. They are your sources of data. If you don’t understand databases like the computer guy you will never be able to get data out of facebook and Twitter .

Talk to the marketing guy. He knows how to sell ideas to the CEO. If you can explain him what standard error is you are on your way to the CEO.

The world have changed. Statisticians need to get out of their offices and go more to people. There is more competition. I teach people who never has seen an online spreadsheet how they can set up a survey in 1 hour and reach 1000 people in a day. And afterwards crunch numbers in seconds. If they can do that they think they never need a statistician. Because they can do it themselves.

Runy

http://Www.calmera.nl

See my online course there about crunching numbers for policy advisors. You still have to explain me what the standard error is. I will put it in my course.

: o )

My POV: Statistics is dead important. Dedicated statisticians are not.

We need to teach stats to the comp sci people crunching lots of data.

I’m a statistics PhD student, and I’m TAing a popular “Data Science” class. I’ve noticed that the quality of statistics instruction in the class is quite poor. It is geared toward giving the students the practical skills necessary to wrangle data and apply algorithms to it. I don’t see the class or its philosophy as a threat to the field of statistics, though. I think it’s training people to work in skilled industry jobs that don’t require a ton of thought and that pay pretty well. Basically, they’re becoming skilled craftsmen, which is valuable and important. Statisticians will still maintain a prominent role creating new methods and studying their behavior both theoretically and in practice. Applied statisticians definitely need to keep up with advances in computation, however, because how can you study the behavior of a method in practice if you can’t implement it?

Dr. Matloff,

I think in one of your postings in this blog, you cited an incident of a Bayesian talk.

I

must say, that the most difficult part of initiating a Bayesian Analysis is the selection of the Prior for the parameter in hand. I am not aware of the actual situation, but, It becames a mathematical exercise otherwise, if the Prior does not represent appropriately the variability in the parameter. I think years ago, during the sixties, the concepts of a “Vague Prior”, ” Diffused Prior” and “Asymtotically Invariant ” priors were considered where no information or vague information on that variability were available.

As for very large data sets without any established prior information or probabilities ( perhaps in recent new Genetics or Genome projects, social Or new health care area) where history has not been yet statistically established, it will be very difficult to proceed with any assumptions on the Prior for the parameter.

But , there are many amazingly wonderful simulations possible these days with the help of fast computers.

“If you’re analyzing data, you’re doing statistics.” This is not a true statement. You might be doing statistics, but not necessarily. Data science is broader than statistics.

I can’t find the reference now, but someone described data science as four activities:

1. Technical analysis — could include statistics, optimization, logistics, financial engineering, decision analysis, simulation, machine learning and other applied math disciplines that might use statistics but don’t necessarily require statistics. Using R, MATLAB, and other commercial and open source tools and languages.

2. General programming — often client or server programming, using tools like C, C++, Java, javascript, python, PHP, etc.

3. Database administration and development — working with SQL and NoSQL databases.

4. Data Visualization — producing accessible and informative results and sometimes interactive visualizations.

People who primarily program think of themselves as developers. People who primarily do data visualizations think of themselves as designers. People who build agent based simulations or optimizations generally don’t call themselves statisticians.

Data science is a useful term to describe bringing together these four somewhat diverse skills sets to produce valuable insights. Statistics is one of the tools a data scientist can use.

I would not say it is broader than Statistics. By your description, they overlap, since I see Statistics as:

1) techniques for data collection (sampling and census techniques, even using computers and databases, as part of yours #2 and #3);

2) data analysis (this is your #1) and inference (parametric and nonparametric);

3) results presentation (some of your #4).

I simply do not see lots of Statisticians who want to play this game of fashion, fast and (mostly for business) crude but usefull results.

I took stats classes without majoring in it, and there was no presentation, or database management, only mathematical proofs and example problems.

I’m sure statisticians deal with plenty of of the previous, but I find it hard to believe that the major stats journals publish pieces on database management or flashiest graphing designs.

While I’m an “outsider”, I think the person who said the data analyst was surprised he got so many spuriously significant results brings out something that will be increasingly evident. Statistics–and I think “statistical science’ should always be used–has been around a fairly long time, and it’s not going anywhere. Note how the methods newly developed for catching fraudulent uses of data all make use of standard statistical ideas.

http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1996631

I see one of the authors is from CMU. That’s why my April 1 spoof, turned out not to be such a stretch. In short, I think the big data/fast data is giving rise to a lot of big doubts*, at least when it comes to science as opposed to marketing/advertising. If I was in stat I’d find “fraud forensics” through statistics irresistible.

Of course there are a lot of meaningless numbers within “applied stat”, but the guilty parties are not statisticians, they are researchers who have been told just to run through some computer programs without understanding the concepts or methods in the least. Perhaps the statisticians could speak out more against such abuses, while promoting the absolutely correct idea that these guys be required to do more than take a “stat methods” course in their social science or other field. The new rise in noticing bad statistics in the social sciences, law, and medicine could well be a basis for an intervention**.

In any event, “2013: the year of stat” is intended to help on the naming business, I think.

* http://errorstatistics.com/2013/03/04/big-data-or-pig-data/

**When we hear people doing watered down ethics for X, for instance, we complain. Of course this means not applauding some of the silly remarks put out by popular “statisticians” like Nate Silver.

That certainly is a new angle to look at…Simulations, Optimization & Convex Optimization are not Statististics and not Mathematics either?

Based on Linear and non linear sets of equations, widely used in Multivariate Analysis and Numerical Analysis , these techniques must have unknowingly morphed to Data Analysis, which is where the Applied Statisticians start doing Statistics. Like it or not. They love data. I do and I have fairly good background in Computers, not just the Stat software.

Next we have to learn that the Measure Theoritic Approach to Probabilty is not Stat either and belongs to this Data Science.

Sorry, professional Statisticians are slightly different than those computing the average NFL scores. Call them the Real Statisticians.

They also get paid more than Ph.D. Statisticians. Reminds me if a statement made to me years ago and got ingrained in my psyche

“——— Ph.D.s come dime a dozen”

The blank part was a little off color , so I left it blank.

Larry: “No one is listening”? I don’t know what department you are in, but you should leave it.

As the founding director of the new Center for Data Science at NYU and the instigator of a new MS in Data Science, I thought I might chime in.

Data science is about extracting knowledge from data, particularly when it is done automatically with computers and (possibly) large datasets.

The underlying methods of data science do not come only from Statistics, but also from computer Science (machine learning), and from some areas of Applied Mathematics (large-scale linear algebra and scientific computing, optimization,….).

Why call it “Data Science”? why not “Statistics”, “Machine Learning”, or “Applied Mathematics”?

As Larry has pointed out in one of his previous posts, the boundary between statistics and machine learning has essentially disappeared and we should all feel good about this.

Each field could claim to have absorbed a piece of the other. But as boundaries between stats, CS, and applied math are being redrawn, we are experiencing the emergence of a new field: Statistics + Computation + the relevant bits of Mathematics.

We could call it “Statistics” but our computer scientists friends would not like it.

We could call it “Machine Learning” but our statistician friends would not like it.

We could call it “Applied Mathematics” but that wouldn’t reflect reality.

We could call it “Computational Statistics”, but that would sound like the intersection of statistics and machine learning, not the union thereof.

We could call it “Statistical Computing” but that wouldn’t sound very inclusive either.

What I like about “Data Science” is that everyone can feel at home. Also, it means exactly what it says.

Think about the situation of Computer Science in the 1960’s. There was no such thing as computer science. To some it was just a branch of mathematics, to others it was a branch of electrical engineering. Yet, as computers started gaining importance, “computer science” became too big to fit inside EE or Math. The boundaries were redrawn, and a new discipline emerged. Now every good university has a computer science department.

Since the data deluge is here to stay, and since science, medicine, business and government rely increasingly on the automatic extraction of knowledge from data, Data Science is becoming important enough to cause the boundaries between disciplines to be redrawn.

This is a sign of unmitigated success for our field(s).

Larry: the separate pot of government money for statistics may never happen, but it might happen for data science.

I feel much the same way. And I agree with your solutions: revamp statistics training and hire at the Stat|CS interface.

Regarding training: if we add more on software development and data management, we’ll need to subtract something (say measure theory level math stat). I would argue for flexibility, to allow more diversity in statistical science.

The tricky part in hiring (and retention) is getting our more classical colleagues to acknowledge the value in work outside the Annals and JASA.

First, I don’t think it is really important as long as the “science” in Data Science includes (inductive) logic as that is the most purposeful contribution of the Statistics discipline.

Unfortunately that is very hard to market, given its so poorly explained even amongst statisticians and likely well understood by few.

Now I also don’t think it is mostly a marketing problem but rather a product delivery problem with regard to applying statistics to other’s purposeful activities (as opposed to thinking about applying statistics in the abstract or in demonstration projects).

My own convenience sample of the work of (masters and Phd) statisticians in applying statistics to other’s purposeful activities suggests a very substandard product. “ignoring biases, computing the wrong standard errors, not stating critical assumptions but checking less important assumptions with hopeless power”

Now this is largely driven by a pressure to the lowest standard (as some have pointed out). Some instances of this were a consulting statistician deciding to back-off on addressing multiplicity as other statisticians at her university did not do that and folks were avoiding her to work with others, another at a Stat Dept consulting service who did stepwise regression because they needed to do something in a couple hours they could bill for and another who chose to make stepwise regeression the standard in their medical school because it maximised the publications with the minimum of the time (both of these were well aware of Rubin’s work and how that underscored the damage naive stepwise regression does), etc.

Does anyone know of a random sample of the usual work of statisticians that has been thoroughly evaluated?

And my favourite example http://andrewgelman.com/2010/04/15/when_engineers/

It’s easy to justify making a new name because there is much upside associated to it, and virtually no downside. It bring new attention, more money, and it projects a sense of pioneering, groundbreaking spirit. In the “real” world, its called marketing. Not that there is anything wrong with that.

I tend to support the concept of what Dr. Yann LeCun is describing. This is a dream department that I was commenting earlier about, for future. I am so happy that some are thinking about it and implementing. The Measure Theory does not have to be included in there. I don’t mind, the time has come to break down the barriers between CS, Stat and Math for the students. For matured professionals those disciplines are already seamlessly integrated into each other.

About the name, again What’s in a name, if the students learn the needed subjects properly.

I recognizse of course naming a new born has always been an issue hard to resolve satisfying all in the family.

There are ASA editorial committees for broad disciplines , who are supposed to be evaluating the papers intended for publications. Can there be an ” an oversight committee” to evaluate the research projects? Who knows?

Those are evaluated by the experts I thought.

Interestingly here is another news this morning…

http://www.wired.com/wiredscience/2013/04/brain-stats/

I rest my case.

I think what happens is anyone who only has a cursory understanding of statistics is primed to understand very well how to quantify sampling error. And in Big Data, well, sampling error is one of the smallest sources of uncertainty. “Big Data” exists often in the world, not of science, but of policy and decision making. It’s great when that is based on true physics-level science, but that is rare. A lot of the “pretty graphs” are low-hanging fruit for people who honestly don’t even know how to do simple EDA on their data sets for any kinds of questions they want to answer.

It takes a long time to get the “right” confidence level when you cannot employ an existing toolkit, and you are trying to generalize to future and brand new people or units or innovations. Harder to do than account just for sampling error. I’ve helped a lot of people by giving them a good way to look at their data; the more complex it is, the harder it is to integrate the more precise inference from stats (even nonparametric stats because that still depends on eg, structure of functional spaces), because the hard thing to do is figure out how to apply the stuff coming out of the more leisurely world of academics in these decision and policy based environments. So, lots of coaching on reading and interpreting results that are coming out of CS, ML, Statistics, figuring out how to quickly prototype, evaluate, and adapt the methods that do produce confidence intervals for “data sets somewhat but not entirely like your own.”

initially they took and svm and marketed it .. now they are taking the whole field and marketing it ..as simple as that

Joel, do you mean to say that SVM came out of Statistics and was somehow marketed by other people?

SVM came out of the office next to mine at Bell Labs. Our department was originally a physics department that turned itself into a neural net department and eventually into a machine learning and image processing department.

Vapnik does not see himself as a statistician but as a mathematician (well, he actually sees himself as a “natural scientist”, but that’s another story).

Of the original co-authors most associated with SVMs and kernel machines, Guyon and Boser are engineers, Cortes is a physicist turned machine learner, Burges is a physicist turned software engineer turned machine learner. Schölkopf is also a (bio)physicist turned machine learner.

Every field is enriched when it makes contact with another field. Sometimes one field one field absorbs another, but sometimes a whole new field emerges from the interaction. This is not a territorial war. It’s an opportunity to join forces to solve important problems. Statistics need computation and optimization (and other areas of applied math) as much as machine learning needs statistics and applied math needs the problems of data science.

I agree with Yann.

We should view this as an opportunity.

Larry

why is the concept of Standard Error so hard to understand by so many ? I always wondered about it.

An estimate of the measure of variation of a very large collection of ‘ things’ from a sample taken from those ‘ things’ .

In a statistically correct way.

Putting it in simple terns of course.

I find that Standard Error is the only conception of variability most non-statisticians have an idea that they should be on the lookout for. Except usually they are thinking of it in reciprocal terms; that is, I get a lot of co-workers coming to me asking, “What sample size do I need for significant results?” Most times, people cannot describe the unit that they are sampling, if I ask, and then they go on to say, “I have this data set I’m collecting from X”, which is not a result of sampling at all, in which they are not performing any kind of experiment or selection, and the kinds of generalizations they want to make are extremely difficult to capture in some kind of model wherein the residual variation would be meaningful. It has always been much more difficult to quantify uncertainty due to confounding, bias and model misrepresentation.

I think all these points are valid. I’d like to bring up something a little different. That is that the overwhelming majority of these large datasets are observational data sets. Even “designed” case control genomics experiments are a form of observational data. And yet, people get confused and feel disillusioned when “statistically significant results” don’t replicate.

I’m not anti-NHST in general, but I do think they are often problematic to use for observational data where the data generation process is not fully defined and/or the assumed sampling process could easily be way off. One of the primary features in the “deluge of data” being generated these days (that is not discussed nearly enough) is that it is almost all observational. Statistics is obviously far far more the NHST, but NHST is how statistics gets used most widely. With confounding as an elephant in the room, the guarantees and advantages of answering questions with respect to a null distribution are less relevant, or worse, give people a false sense of security in their interpretation.

I think it would be useful for statistics to emphasize tools that are more useful in these contexts – e.g. (but not limited to) graphical models, estimation of effects, causal inference (and generally reasoning about causality in observational contexts), etc. I’m not against using NHST in general, but the overwhelming emphasis on NHST in introductory classes is not helping statistics to stay relevant. I’m also not saying that observational data _should_ take the place of proper experiments either (they should not), these comments pertain to the relevance of statistics to common “data science” problems which frequently arise nowadays.

By NHST I assume you mean

“null hypothesis significance testing”?

Yes, that’s correct

You are raising what is undoubtedly the worst miss-application of statistical logic (reasoning): using randomisation logic when there was no randomisation.

Now, even highly trained statistical faculty do this. Recently one claimed that the distribution of p_values from (confounded) observational studies would be uniformly distributed if the the null hypothesis was true (which would/should be understood as the factor of interest having no effect).

This is largely a problem of statistical education and why Judea Pearl is offering funding through ASA to help redress this and why Sander Greenland regularly takes various authors to task for assuming essentially no residual confounded (i.e. achieved randomisation by magic).

About 25-30 years ago, I used the EDA , ( ref. John Tukey ) methods and these were strong tools for data analysis which give out lot of information from data sets. However, using those arbitrarily without a focus to the objective can be risky for curious non statisticians.

Been a member of ASA and Sigma Xi for many many years. Joined ASQ during Dr. Deming’s era and became its Senior member. I do have a Ph.D., examined and approved with appreciation from some of the top experts from school like Berkeley . That had taken my pain way.

Now I am retired and enjoyed this opportunity to participate in this very interesting topic.

I haven’t read the comments, but I would assume that the answer to question 1 is the same reason that you wouldn’t go to a mathematician if you wanted to build a bridge: you want a bridge built using reliable, realistic mathematical models thoroughly tested by engineers focusing on known failure mechanisms, not “nice”, “elegant” models thoroughly analysed by mathematicians focusing on the ability of people to analytically prove theorems about it.

As for question 2, if there were more of a focus on reality and testing in statistics research, instead of “elegance” and theorems, I think statisticians would be taken more seriously. I’m not saying that there’s no benefit to research for the sake of research, and I’m not saying that people wouldn’t benefit from a more sound statistical view on things, but you can’t expect people to then come to those same researchers when there are real-world problems to solve.

There are lots of statisticians who do applied work.

In fact, I would say theory is a minority.

Let me describe something from my own experience in working as a Statistician in industry full of engineers and maintaining my role of a statitician under a job title of a Senior Engineer. The reason being, there was no position approved as a Statistician. Working and keeping my identity as a Statistician was like driving in an one way street going in the opposite direction. My job was instructing the engineers and others in Statistical methods and showing applications. For that I had to learn the processes, and the engineers had to learn Stat. They resisted to learn too much stat, because they had to complete the assignments on time and accurately. For the same reason, I had to limit my questions and queries about the processes. In other words, the freedom we enjoy in a classroom is not the same as instructing within an industry. There are rules and regulations, customer demands and supplier schedules to be followed. I had to keep in mind, that I am not making statisticians out of engineers , I am not here to do that. Also, I cannot lose my own identity as an applied statistician.

All I can say, so i dont sound like a Preacher, that teaching in a classroom to students yet to become engineers and teaching engineers already working are two entirely different scenario. Later, when I returned to the class, I found that the students are learning things, that are different from what they will be encountering in the field. They were totally unaware of what they will be doing once they get jobs. I am glad I saw both sides.

Again, I do not want to say, this is a common thing, but perhaps others too had similar experience. In industry what Statistics we use is very much almost dictated by the process itself one is studying. And sometimes those can get complicated , so that engineers have to look to an expert in Statitics to resolve that. selecting the right method and carrying it through to a statistically correct conclusion is definitely an interesting and satisfying but stressful journey. Most of the theoretical stat with all it’s glory of theorems, extensions and addendums do not come to rescue.

I think the data science vs statistics distinction is really Breiman’s _Two Cultures_ http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.ss/1009213726

Basically it says that conventional statics is primarily concerned with (estimation of) the parameters of understandable (simple, parametric) models while the new culture is primarily concerned with prediction/inference using potentially black-box (nonparametric) models, which inevitably depends on computers. Breiman argued that focusing on predictive accuracy extended the applicability of statistical methods. He did not live to see the invention of data science but it seems to be what he was talking about.

Well, interesting points on the ongoing discussion. I really think that we do not need to discuss about things like that but instead should focus more on cooperations.

I have blogged my rough ideas on http://www.philippsinger.info/?p=149 if anyone might be interested.

One tiny bit to add (as a computer scientist who works at the intersection of cs + math + stats + visualization): there’s a similar issue to the stats to cs comparison you make within cs itself. This is the algorithm expert versus an application maker. The first has to think hard about the problem (and may even leave the implementation to others); the second can “whip something up tomorrow with a pretty GUI”. Interesting problems typically require some people that are really good at the thinking and then some (possibly other) people that are really good at the implementation (whether the “thinking theory” is statistical or algorithmic).

Adding a comment I strongly believe, having a strong theoretical background helps sometimes in getting into the depth of a problem and seeing through the problem to choose the right statistical method. May not be a grandios technique all the time, but simple tools can be very effective too. That depends on the technique of course.

Approaching a data analysis with a preselected tool vs. studying the data and then selecting the tool. For that, the first step has to be the Data. In case Data Science can provide it, that ” thinking” another posting pointed out, it is beneficial to the students.

This is what I believe as an applied statistician, chronologically starting my 7th Decile.

Thank you Larry for posting another thought provoking piece.

Answering 2, since the world is not going to change (and associate data science with statistics) anytime soon, perhaps it is time for statistics and statisticians’ attitude to change. (But that’s real wishful thinking too, I know.) If us statisticians are interested in joining the data science revolution coming, we should consider leading by example.

Let’s keep computing those truthful standard errors, but faster, making approximations and stating the assumptions behind them and their implications. (If they are too big to support a particular conclusion, perhaps the data is better suited to answer a different question.) Let’s produce usable software and work out case studies ourselves the way we claim they should be done, rather than telling scientists how they should improve the analyses they worked so hard to carry out. Let’s apply for funding to CISE, where variability in data and decisions under uncertainty are very much appreciated (Aarti is quite right). […] Lots can be done to promote statistics and raise awereness.

It seems to me that us statisticians are good at demanding attention, but if it takes getting out of our comfort zone to turn things around, we’d rather not. So let’s get in the mud and show what data science done right looks like. There is plenty out there. I call it applied statistics. A lot of people’s favorite example can be found here:

http://www.amazon.com/Applied-Bayesian-Classical-Inference-Federalist/dp/0387909915/

I’m a first year PhD in a stats program, and I think statistics is really cool! I tend to agree with most of this post. I think statisticians should do the following in order to ensure that data science does not overtake statistics:

– Insist on a balanced view of the merits and defects of each data-centered discipline. It is all too common to hear statisticians characterize practitioners of newer fields as “data monkeys” who “just move data around”, and “don’t care that their answers are correct”. Contributions from other fields are played down: “Machine Learning is just a new name of nonparametric statistics” and it “doesn’t have any cool applications”. This doesn’t do justice to those fields: they are much more than this. Statisticians should recognize this. But they should also point out the weaknesses of those fields, so that one has a balanced overall perspective.

– Fairly evaluate when a data-science, machine learning, and statistical approach are best for any given problem. For instance, data science is suitable in big data applications where there is overwhelming signal. Machine learning can detect more subtle patterns. Finally statistical inference, through confidence intervals and the like, offers the ultimate guarantees in terms of the correctness and confidence of the answers.

– Be inclusive about all the issues involved in applications of data procedures to real-life problems. Back in Fisher’s time, the only issues were experiment design and statistical analysis. Nowadays, things have changed. There are other issues: computation (storing the data, “moving it around”) and visualization are highly complex and nontrivial issues that must be figured out in any data application. If statisticians don’t do it, someone else will.

– Raise awareness about the contributions of classical statistics. Increase the visibility of the field outside of the classical consumers (sciences, engineering. business, government). Baiscally, try to make sure everyone knows what we are doing, and why it’s crucially important for them. Make cool documentaries about statistics.

– Improve statistics education. Lobby that statistics be included in high school curricula. Argue that it’s just as important as calculus, if not more so. Ensure that it’s presented in a cool way, such that students appreciate its importance for their lives.

– Do not cling on to tradition too much. Concepts such as the null hypothesis might have been the fundamentals of statistical inference in science 80 years ago, but nowadays, they seem quite dated. Many of the biggest science projects (genomics, neuroscience) do not follow this pattern of “testing the null” anymore. Discovery and exploration, prediction and classification are much more intuitive and direct notions, and they tend to take the center stage in people’s thinking as well.

– Finally, be proud and happy about our discipline :).It addresses some of the most puzzling and pressing intellectual questions of our times.

I will begin my comment by stating upfront that I am not a statistician or computer scientist, but merely a humanist fascinated by (attempting to apply??) data science in my own field. Techniques in data science, big data, data visualization, data mining, etc. are currently being actively employed in the humanities. Data visualization, in particular, is one of the hottest trends in the digital humanities. For example, in just the past few months, numerous online articles, blogposts, etc. have appeared discussing/critiquing IBM’s data visualization tool, Many Eyes, in terms of its application for analyzing language and literary texts. The result has even been a new approach–albeit a controversial one–to reading literature: distant reading (for more information, please see the New York Times article: )

Consequently, I am intrigued by the observation in the original post that “[t]he very fact that people can talk about data science without even realizing there is a field already devoted to the analysis of data — a field called statistics — is alarming.” I realize that this stated concern relates to how other *scientists* engage in the practice of your field; however, this statement could just as easily apply to the humanities…although I would add that the ability of humanists to interpret the data accurately may also be of concern.

How do you feel about *humanists* talking about data science and using its applications? What advice would you give?

I have no wish to redirect what is obviously a very productive dialogue in response to the original post (I have been reading the growing number of comments over the past few days with considerable interest). It has simply motivated me to ask how such a new direction in my own field is viewed by yours.

Thank you.

You raise an interesting point.

We tend to focus on statistics applied to the sciences but, as you correctly point out,

the discussion should be much broader than that.

The born of Data Science appears to me as a consequence of the demand-supply effect: there are fewer statisticians than the world needs. In addition, the “theoretical-mathematician-pride” that many statiticians have precludes them from getting their hands dirty by analysing real data and/or not-too-challenging data sets. This produces a series of effects such as: (i) non-statisticians (such as biologists, psychologists, medical doctors, …) analysing their own data since (and I am a faithful witness of this) many statisticians do not want to do it; (ii) General believe that Statistics=Clicking the right buttons of certain softwares; and (iii) the creation of groups or areas of non-statisticans aiming to fill the gap who analyse data sets with the tools they know, Data Science is an example but there are other examples such as groups of physicists giving statistical consultancy without statistics (as paradoxical as it sounds).

hi larry,

whenever i hear the phrase “data science”, i wonder, is there some other kind?

hope you’re well.

Data science is more than statistics: it also encompasses computer science and business concepts, and it’s far more than a set of techniques and principles. I could imagine a data scientist not having a degree – this is not possible for a statistician. But the core of the issue, in my opinion, is explained below. (you might want to read my answer on my original blog at https://bit.ly/10yvocu because it contains clickable links that are not rendered on this page)

1. I am one of the guys who contributes to the adoption of the keyword data science. Ironically, I’m a pure statistician (Ph.D. in statistics, 1993 – computational statistics) although I changed a lot since 1993, I’m now an entrepreneur. The reason I tried hard to move away from being called statistician to being called something (anything) else, is because of the American Statistical Association: they killed the keyword statistician as well as limiting career prospects to future statisticians, by making it almost narrowly and exclusively associated with the pharmaceutical industry and small data (where most of its revenue comes from). They missed the boat – on purpose, I believe – of the new statistical revolution that came along with big data over the last 15 years.

2. Statisticians should be very familiar with computer science, big data and software: 10 billion rows with 10,000 variables should not scare a true statistician. On the cloud (or on even on my laptop as streaming data), it gets processed real fast. First step is data reduction, but even if you must keep all observations and variables, it still is feasible. And good computer scientists also produce confidence intervals – you don’t need to be statistician for that, just use the First AnalyticBridge Theorem (if you are curious, check out the Second AnalyticBridge Theorem). The distinction between computer scientist and statistician is getting thinner and more fuzzy over the years. The things you did not learn at school (in statistical classes), you can still learn it online.

I agree about the ASA.

They are a large bureaucratic entity

with a very old-fashioned view od statistics.

Two thoughts:

1) I view much of “statistics” as model fitting (and hypothesis testing of said models) given data. This definition does not encompass the entirety of my view of “data science”. For instance, research on methods for interactive visualization is IMO a form of “data science” but has little to do with conventional “statistics”.

2) When I think of the “next level” challenges in ML research, I come up with the following (non-exhaustive) list:

— Learning hidden structure & associations from unstructured data

— Feature induction via large scale unsupervised learning (e.g., deep learning)

— Algorithms with Consistency and efficiency guarantees distributed/parallel learning/inference

— Machine learning + market dynamics (e.g., auctions w/ uncertainty)

— Interactive machine learning with humans in the loop (e.g., interactive clustering or interactive feature induction)

— Structured crowdsourcing (e.g., at what level of granularity do we design tasks for human workers?)

— Streaming algorithms

— How to store/retrieve data in a way that maximizes the trade-off between various notions of accuracy vs speed

From where I stand, almost none of these challenges lie exclusively within the purview of conventional “statistics”, but I don’t see the definition of “statistics” to be evolving to accomodate these challenges. On the other hand, “machine learning” people have been embracing all of these problems and, in some sense, have been expanding the definition of “machine learning” by doing so. Furthermore, while the boundary of ML & statistics have blurred (and more or less disappeared), the boundaries between ML and systems & HCI are also starting to blur (since a large fraction research in those areas is devoted to data-intensive applications).

That’s a good point. ML people are enthusiastic about

jumping into new problems. Statistics, on the whole, tends to be

slow about embracing new problem areas.

I do not understand the need for a new name… When a person ages, he/she evolves with new acquired knowledge and experiences that she/he undergoes… one simply does not change his/her name every year to reflect that !

similarly when a field grows it evolves with new viewpoints and new problems motivated by scientific demands… analyzing data was called statistics.. but then people where not happy and called in machine learning.. now at least lets stick to one of these two and try to understand the fact that its just the same old field that has evolved/is evolving and not a new field that is being born..

Dear sir,

Your blog is really inspire me. I am graduate student in statistics from Mumbai, India. I also done sas certification in Base programming and BI. After reading your blog i am thinking to start teaching sas in my own college from where i have done graduation. And give knowledge of sas programming to statistics student before graduation which help them to work with todays analytical technology.

Please contact me when your free. I need your help in this work. Because i have knowledge of programming but I am just graduate. I like to teach high end statistics to student. For that i have to total knowledge of application of statistics.

Thanks and regard

Sumit Parkar

## 3 Trackbacks

[…] Data Science: The End of Statistics? […]

[…] [3] https://normaldeviate.wordpress.com/2013/04/13/data-science-the-end-of-statistics/ […]

[…] Larry Wasserman问，数据科学会不会是统计学的终结者呢？媒体铺天盖地的谈论大数据，数据挖掘。人们现在可以谈论数据全然不需要知道这个世界上还有一门学科叫做“统计”。没有误差分析，不需要检验假设的“数据分析”是不是统计的终结？统计该归于计算机还是数学？为什么统计会被边缘化？我们该如何对待？ […]