Welcome to my blog, which will discuss topics in Statistics and Machine Learning. Some posts will be technical and others will be non-technical. Since this blog is about topics in both Statistics and Machine Learning, perhaps I should address the question: What is the difference between these two fields?
The short answer is: None. They are both concerned with the same question: how do we learn from data?
But a more nuanced view reveals that there are differences due to historical and sociological reasons. Statistics is an older field than Machine Learning (but young compared to Math, Physics etc). Thus, ideas about collecting and analyzing data in Statistics are rooted in the times before computers even existed. Of course, the field has adapted as times have changed but history matters and the result is that the way Statisticians think, teach, approach problems and choose research topics is often different than their colleagues in Machine Learning. I am fortunate to be at an institution (Carnegie Mellon) which is active in both (and I have appointments in both departments) so I get to see the similarities and differences.
If I had to summarize the main difference between the two fields I would say:
Statistics emphasizes formal statistical inference (confidence intervals, hypothesis tests, optimal estimators) in low dimensional problems.
Machine Learning emphasizes high dimensional prediction problems.
But this is a gross over-simplification. Perhaps it is better to list some topics that receive more attention from one field rather than the other. For example:
Statistics: survival analysis, spatial analysis, multiple testing, minimax theory, deconvolution, semiparametric inference, bootstrapping, time series.
Machine Learning: online learning, semisupervised learning, manifold learning, active learning, boosting.
But the differences become blurrier all the time. Check out two flagship journals:
The overlap in topics is striking. And many topics get started in one field and then are developed further in the other. For example, Reproducing Kernel Hilbert Space (RKHS) methods are hot in Machine Learning but they began in Statistics (thanks to Manny Parzen and Grace Wahba). Similarly, much of online learning has its roots in the work of the statisticians David Blackwell and Jim Hannan. And of course there are topics that are highly active in both areas such as concentration of measure, sparsity and convex optimization. There are also differences in terminology. Here are some examples:
Statistics Machine Learning
Data point Example/Instance
Regression Supervised Learning
Classification Supervised Learning
and of course:
Statisticians use R.
Machine Learners use Matlab.
Overall, the the two fields are blending together more and more and I think this is a good thing.