Classical Statistical Intuitions in Machine Learning: Do They Still Hold?

General Overview
Recent advances in machine learning (ML) have sparked debates about how traditional statistical intuitions, particularly the bias-variance tradeoff, hold up in the context of large-scale, complex models. This article highlights the critical differences between historic statistical practices, which focused on fixed designs and in-sample predictions, and modern ML, which prioritizes generalization to unseen data. Specifically, it explores how switching from fixed to random designs can significantly alter the classic bias-variance tradeoff and lead to phenomena like double descent and benign overfitting.

Fixed vs. Random Design Settings
Classical statistics are often focused on fixed design settings, where training and test inputs are assumed to be the same, and only the outcomes vary. In contrast, modern ML methods rely on random design settings, where training and test inputs are newly sampled, making out-of-sample prediction accuracy the critical focus. The move to random designs profoundly affects the expected bias and variance of models, altering predictions and challenging traditional interpretations.

Bias and Variance in Machine Learning
In ML, one of the fundamental principles is the bias-variance tradeoff, which suggests that as model complexity increases, variance goes up, and bias goes down. However, this article argues that such tradeoffs do not always apply, particularly in random design settings. For example, simple estimators like k-Nearest Neighbors show that bias and variance can decrease as complexity is reduced, undermining the clear-cut tradeoffs posited by classical statistical methods.

Double Descent: A New Phenomenon
The article discusses the emergent phenomenon of double descent in ML, where, contrary to traditional beliefs, prediction errors first decrease, then increase, and then decrease again as the number of parameters exceeds the number of training examples. Although the double descent violates classical intuitions, this behavior is different primarily due to the random design setting. The critical insight is that traditional U-shaped error curves stem from the fixed design perspective, whereas double descent reflects random design dynamics, particularly when models are over-parameterized.

Benign Overfitting and Why It Can Happen
Another surprising phenomenon in ML is benign overfitting, where models perfectly fit their training data yet still generalize well to unseen data. While conventional statistics would suggest that models with zero training error are likely overfitted, the article argues that benign overfitting, or more accurately “benign interpolation,” can occur when the model’s behavior for test points is smoother than its behavior on training data. This can happen frequently in methods like neural networks and random forests, which adapt and behave robustly when facing new input data.

Implications for Practice
The transition from focusing on in-sample predictions to prioritizing out-of-sample generalization has significant implications for how we think about model complexity, overfitting, and generalization performance. The findings emphasize that ML practitioners must reconsider traditional methods for diagnosing overfitting and bias-variance tradeoffs and adapt their strategies, considering that interpolation doesn’t necessarily imply poor generalization in modern, complex models.

Concluding Thoughts
Ultimately, the article suggests that while classical statistical intuitions provided solid foundations, the shift in focus of modern ML—from fixed design settings to random design—calls for a re-examination of long-held beliefs. When the focus is out-of-sample performance, as in ML today, some earlier concerns about overfitting and model complexity become less relevant. When correctly managed, the added complexity of today’s ML models can still lead to accurate predictions on new, unseen data, potentially changing how we define and understand “optimal” performance in the future.

Resource
Classical Statistical (In-Sample) Intuitions Don’t Generalize Well