What’s the difference between Statistics and Machine Learning?

Since I approached Machine Learning during my Ph.D. in Statistics I’ve always tried to compare classical statistical approaches and machine learning ones. I mean, they are surely both fundamentally based on data and they both try to extract some kind of knowledge from data so where exactly is the difference? What is inherently different in those two fields?

To answer these questions let’s start from the very beginning: the definitions.

Statistics is a traditional field, broadly defined as a branch of mathematics dealing with data collection, organization, analysis, interpretation, and presentation (ref).

Machine Learning is a field of artificial intelligence that uses statistical techniques to give computer systems the ability to “learn” (e.g., progressively improve performance on a specific task) from data without being explicitly programmed (ref).

Since machine learning sometimes uses statistical techniques it can easily be confused as rebranded statistics. But the way statisticians use these techniques is different than the way they are used by machine learning scientists.

Leo Breiman in a paper called Statistical Modeling: The Two Cultures gives a very thoughtful description of this difference describing two approaches to data modeling:

The Data Model (used by statisticians)
The Algorithmic Model (used by machine learning scientists)

Naturally, both models can be used to understand data and make predictions. But the two approaches do something fundamentally different. The “Data Model” used by statistician makes upfront assumptions about the process that generated the data while the “Algorithmic Model” used by machine learning scientists tend to ignore the process that generated the data (considering it unknowable or uninteresting) and instead focus to model only the observed relations between data.

Breiman uses the example of a black box with inputs and outputs.

The analysis in [the data modeling] culture starts with assuming a stochastic data model for the inside of the black box … The values of the parameters are estimated from the data and the model then used for information and/or prediction.” Statisticians validate models using goodness-of-fit tests and residual examination and the goal of these analyses is exactly to check whether the data allow rejecting the initial hypotheses. If the result is rejection than the model is wrong, either way simply “the data at hand cannot disprove the correctness of the model” (which is commonly misunderstood as “the model correctly describe the data-generating process”)

On the other hand the analysis in the algorithmic modeling culture “considers the inside of the box complex and unknown. Their approach is to find a function f(x) — an algorithm that operates on x to predict the responses y.” Models are validated using performances on unseen data.

Breiman goes on to explain the thinking of the data modeling culture, practiced by statisticians:

Statisticians in applied research consider data modeling as the template for statistical analysis: Faced with an applied problem, think of a data model. This enterprise has at its heart the belief that a statistician, by imagination and by looking at the data, can invent a reasonably good parametric class of models for a complex mechanism devised by nature. Then parameters are estimated and conclusions are drawn.

In other words, classical statistical approaches to fitting a model based on assumptions about how data were generated can lead to a “multiplicity of models”. And this means that there may be many models that can fit the data, but these do not necessarily reflect the relationships between inputs and outputs.

This is not what today’s machine learning practitioners would call “data-driven”.

With data gathered from uncontrolled observations on complex systems involving unknown physical, chemical, or biological mechanisms, the a priori assumption that nature would generate the data through a parametric model selected by the statistician can result in questionable conclusions that cannot be substantiated by appeal to goodness-of-fit tests and residual analysis.

To be completely fair this problem is also felt in the statistical community and it is faced using the so-called non-parametric approaches. In particular, in classical statistics, non-parametric approaches completely avoid distributional assumptions while Bayesian non-parametric approaches assume very broad priors on the data distribution, and this enables more flexibility still keeping the advantages that can be brought by a model specification (at the cost of a significative computational burden).

[The machine learning] community consisted of young computer scientists, physicists, and engineers plus a few aging statisticians. They began using the new tools in working on complex prediction problems where it was obvious that data models were not applicable: speech recognition, image recognition, nonlinear time series prediction, handwriting recognition, prediction in financial markets.

Machine learning grew out of a very different mindset than classical statistics: the focus was on discovering a function that maps inputs to output to make predictions. The data-generating process is ultimately unknown, and the interest, in this case, is only in finding a function that can reliably map new input data to predictions.

Data models are rarely used in this community. The approach is that nature produces data in a black box whose insides are complex, mysterious, and, at least, partly unknowable. What is observed is a set of x’s that go in and a subsequent set of y’s that come out. The problem is to find an algorithm fx such that for future x in a test set, fx will be a good predictor of y.

Summary

Machine learning lets data (and trial-and-error) speak about the relation between inputs to outputs in a complex system while classical statisticians believe they can represent this mechanism through a well-specified model.

Despite these differences, these two fields can benefit from each other: examples of this fruitful interaction are Bayesian hyper-parameter optimization techniques that emerged in the last few years. But these will be the topic of another post.

Footer