## What Every Statistician Should Know About Computer Science (and Vice-Versa)

Statisticians are woefully ignorant about computer science (CS).

And computer scientists are woefully ignorant about statistics.

O.k. I am exaggerating. Nonetheless, it is worth asking: what important concepts should every statistician know from computer science?

I have asked several friends from CS for a list of the top three things from CS that statisticians should know. While there wasn’t complete agreement, here are the three that came up:

1. Computational complexity classes. In particular, every statistician should understand what P and NP mean (and why you get \$1,000,000 from the Clay Mathematics Institute if you prove that ${P\neq NP}$.). Understanding the fact that searching through all submodels in a variable selection problem is NP hard will convince you that solving a convex relaxation (a.k.a. the lasso) is a really good idea.

2. Estimating computing time. In CS and machine learning, it is expected that one will estimate the number of operations needed to carry out an algorithm. You write a paper with an algorithm and then say something like: this will take ${O(N d^3)}$ computing time. In statistics, we are pretty loose about this. Some do it, some don’t.

3. Hashing. One colleague assured me that hashing is critical if statisticians really want to be part of the “big data” world. (Last week, Siva was kind enough to give me a quick tutorial on hashing.)

Is there anything I am missing? What would you put in the top three?

And how about the other way around? For a computer scientist with an interest in statistical learning, what are the three most important concepts that they should know from statistics?

Here are my nominations:

1. Concentration of measure.

2. Maximum likelihood.

3. Minimax theory.

I think this list is not what most statisticians would come up with.

What would you put on the list?

1. Posted August 24, 2012 at 8:35 pm | Permalink

2. Posted August 24, 2012 at 9:00 pm | Permalink

I think both computer scientists and some statisticians are unfamiliar with basic ideas in what used to be called numerical analysis, and especially methods of numerical linear algebra. I’d consider the first three chapeters of Golub and Van Loan’s MATRIX COMPUTATIONS a suitable summary. These might be followed by texts like Trefethen and Bau, NUMERICAL LINEAR ALGEBRA, and Bjorck, NUMERICAL METHODS FOR LEAST SQUARES PROBLEMS.

3. Posted August 25, 2012 at 1:36 am | Permalink

I hope you don’t mind if I substitute software developers for computer scientists.

There is a famous Fisher quotation “In relation to any experiment we may speak of this hypothesis as the “null hypothesis,” and it should be noted that the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis.”

In software development the null hypothesis is, of course, that ones software works as intended. In my opinion software developers would benefit greatly from the humility implied in Fisher’s statement, as would their software benefit from being subjected to the same rigour of testing as, say, the theory of quantum electrodynamics.

If and when software developers starting doing that, statistics will be applicable to software development in the way it has been applicable to experimental science for over a century.

4. Posted August 25, 2012 at 11:19 am | Permalink

I’d be happy if they understood confidence.

• JohnDoe
Posted August 25, 2012 at 6:19 pm | Permalink

+1 for this – perhaps broadening/adapting it to cover knowing and understanding measures of significance (e.g. p-values and q-values). I’d also add knowing the difference between prediction and inference. I like “Concentration of measure” …but might call it something else for a CS audience.

5. Claus
Posted August 25, 2012 at 1:18 pm | Permalink

The difference between “question-data-analysis” and “data-analysis-question” (as suggested by D.R. Cox). Also, the relationship between “prediction” and “explanation”.

• Rhiannon Weaver
Posted August 28, 2012 at 12:27 pm | Permalink

Ah, can you send me the link for that (which Cox paper?). Thanks!

• Claus
Posted August 30, 2012 at 3:29 am | Permalink

It was a comment in his book “Principles of Stat. Inference”, CUP, 2006. Look for “data mining” in the index to find it.

6. Keith O'Rourke
Posted August 25, 2012 at 1:24 pm | Permalink

For CS rather than give them (certain) fish, teach them to fish computationally. The germ of the idea is here http://andrewgelman.com/2010/06/course_proposal/ , but anticipating computational resources 5-10 years from now (Don Rubin suggests 25 years) you walk them through Rubin’s 1984 conceptual model of Bayes (now called ABC) and use Mike Evans work on relative surprise inference to use (simulated approximations of) posterior/prior as a convenient (prior invariant) multiple of the likelihood.
Now they can manipulate (very important for real learning) and directly view almost all of statistics (i.e. investigate and “see”, sufficiency, ancillarity, completeness?, convergence to quadratic log likelihood, Neyman-Scott inconsistency, likelihood factorization, (silly or at least distracting) maximum likelihood properties, etc. etc.. And they could likely do this on a Saturday afternoon in the backyard while drinking beer (but not Tequila).
Two things I do need to learn – 1. What cannot be demonstrated this way and 2. How does one interest CS into learning about this approach. (As me teenage son commented on my Bayes theorem demonstration using ABC – “Should there not be a formula that does a better job” [than that identified subset of the prior that lead to the same observations]. The answer is no (apart from unimportant simulation error) but that must seem wrong to many if not most.

7. kenmccue
Posted August 25, 2012 at 5:51 pm | Permalink

Having never heard of concentration of measure until your post, just glancing at it, it appears that the central limit theorem is subsumed under it. Is there a work which develops concentration for a more traditionally trained statistician?

8. Posted August 25, 2012 at 6:20 pm | Permalink

For computer scientists, I favor the concept of a data-generating process; the separation of model, principle of inference (MLE, Bayes, etc.), and computation; and the fact that, to quote Gelman, “the difference between “signiﬁcant” and “not signiﬁcant” is not itself statistically signiﬁcant”. The latter is particularly relevant with variable selection via convex relaxations; seeing a coefficient shrunk to zero in one setting and nonzero in another does not, on its own, provide strong evidence for a change between the two settings. I’ve encountered this quite a bit recently.

For statisticians, I completely agree with computational complexity and estimating computing time. I would add the uses and limits of parallelization (e.g. Amdahl’s law) and the hierarchy of expenses for computational operations (CPU << memory access << disk << network).

9. Joshua
Posted August 26, 2012 at 5:41 am | Permalink

10. Christian Hennig
Posted August 26, 2012 at 6:17 pm | Permalink

I’d nominate modelling of real world phenomena and robustness for Stats>CS. And I like your CS>Stats list because it looks so nicely manageable (which makes me a bit suspicious that there isn’t much more important stuff around I don’t have a clue about…).

11. Oliver
Posted August 27, 2012 at 10:54 am | Permalink

Statisticians should know more about indexing in computer science. I keep bumping into the limits of R because of the way arrays are indexed. A data frame indexed by a 32-bit int is not as large as you would think. Plus I would gladly trade R’s parallel capability for a well indexed matrix. On the other hand, I think statisticians knew a lot about external memory algorithms in the 1970’s that now everyone has forgotten.

12. pdh
Posted August 28, 2012 at 6:16 pm | Permalink

One aspect of Statistics that should be emphasized more in the Machine Learning world is experimental design, and how the design should guide the analysis. For example, ignoring nesting in experimental designs or surveys can lead to grossly inaccurate inference. Often there are important features of the experiment or survey (nesting, censoring, truncation, missingness) that get lost as the data gets reduced to a list or table of numbers, absent any context.

13. Posted September 3, 2012 at 7:57 pm | Permalink

For statistics that CS folk should know, The emphasis should be more on the techniques to model the data. This includes estimation, which almost always requires some optimization or integral calculation.

For CS stuff stat folk should know, a solid data structures course (maybe yearlong sequence) and a hardware and OS sequence so stats folk understand how the hardware and software work at a more basic level. As a grad student in stats I know I wish I knew a lot more about hardware and software and how they integrate so I could speed up a lot of things I want to experiment implementing.

Fortunately, with Machine Learning tying the two together there is a great confluence of work that current and future generations will -hopefully- reap the rewards.

14. Posted September 8, 2012 at 2:15 pm | Permalink

Personal Computer Technology is still one of the most analyzed areas. Due to its need universities have started Computer Science online applications. As with the introduction of online education people have signed up in almost so many different areas but still computer science online level applications are popular. The Personal Computer science system concentrates mainly on the application side. Development in different dialects is taught in this area. Languages like BC-plus plus, BC, Pascal and Coffee. It also instructs computer programming concepts, computer programmingand operating application. In computer science applications a student must have a very sound understanding about all the basic applications, architectures and coding.