Seven Key Principles for Building Better Models

Jan 3

Beyond the technical tutorials: fundamental insights for machine learning practitioners

1. Intelligence versus Memory: Learning Requires Generalisation

The fundamental distinction in machine learning is not between different algorithms, but between memorisation and learning.

Memorisation—photographic memory—takes input and reproduces it exactly. It requires no minimum number of examples. It preserves both signal and noise. A single photograph is sufficient; ten thousand photographs are simply ten thousand separate memories. There is no abstraction, no generalisation.

Learning is categorically different. It requires multiple examples because its purpose is extraction: identifying what is invariant across instances, what is essential versus incidental. Learning cannot operate on a single example—there is nothing to generalise from. Given a collection of faces, learning extracts the features that define "face-ness", not the features that define this particular face under this particular lighting.

Example: A model trained on images of cats in various poses, lighting conditions, and backgrounds learns the concept of "cat"—the invariant features (whiskers, ear shape, eye configuration) independent of context. A memorisation system would simply store each image and match new inputs against this database. Only the first can recognise a cat it has never seen before.

Key insight: If your model performs perfectly on training data but fails on test data, it hasn't learned—it has memorised. The goal is not reproduction but abstraction of invariants that generalise beyond the training set.

2. Parsimony: Occam's Razor Applies to Models

The classical pitfall of overfitting is essentially memorisation disguised as learning. An overfit model "photographs" the training data, learning noise (variance) rather than signal (the underlying pattern).

Machine learning is not yet a fully mature mathematical discipline capable of providing theorems that specify, given dataset characteristics, the optimal model architecture and hyperparameters. In the absence of such theory, parsimony provides a practical guide: start simple, add complexity only when justified by improved generalisation.

Practical approach:

Begin with a simple baseline model (e.g., linear regression, shallow decision tree)
Test performance on held-out validation data
Add complexity incrementally: additional features, deeper architecture, ensemble methods
At each step, assess what the added complexity contributes
Evaluate each predictor individually before combining them
Optimise for accuracy first, computational efficiency second

Key insight: A model with fewer parameters that generalises well is superior to a complex model that memorises training data. Intelligence is parsimonious—less is often more.

This principle connects directly to Principle 1: memorisation requires storing every detail; learning requires only the essential invariants. A parsimonious model is one that has successfully extracted signal from noise.

3. Explicit Assumptions: Know What You're Assuming

Every model rests on assumptions. These assumptions—whether acknowledged or not—fundamentally constrain what the model can learn and predict. Making assumptions explicit provides diagnostic power when things go wrong.

Example: Linear regression

Assumption: The relationship between predictors and outcome is linear
Assumption: Residuals are normally distributed
Assumption: Predictors are not highly correlated (no multicollinearity)
Assumption: Homoscedasticity (constant variance of residuals)

When linear regression fails, the first diagnostic question is: which assumption is violated? If the relationship is non-linear, linear regression is inappropriate regardless of how much data you have. If predictors are highly correlated, coefficient estimates become unstable.

Example: Neural networks

Assumption: Relationships exist that can be approximated by composition of non-linear functions
Assumption: Sufficient data exists to estimate many parameters
Assumption: Local minima in loss landscape won't prevent finding good solutions

Beyond mathematical assumptions, practical assumptions matter equally:

Is your training data representative of deployment conditions?
Are features measured with similar precision in training and production?
Will the data distribution remain stable over time (stationarity)?

Key insight: Write down your assumptions before you start. When problems arise—and they will—your assumption list becomes a diagnostic checklist. Violations of assumptions explain most modelling failures.

4. Complex versus Complicated: Simple Building Blocks, Simple Rules

There is a profound difference between complexity and complication. Complex systems achieve sophisticated behaviour through simple components combined via simple rules. Complicated systems proliferate parameters and exceptions—an ad hoc architecture that becomes brittle and opaque.

Biological example: The nervous system

The human brain contains roughly 86 billion neurons, each implementing a relatively simple computational unit. The architecture is built on ubiquitous binary divisions:

Sensory versus motor pathways
Resting potential versus action potential (neurons either fire or don't)
Excitatory versus inhibitory synapses
Bottom-up sensory processing versus top-down predictions

Colour vision spans a wide gamut but rests on just three receptor types (cones). Motor control encodes outputs with two variables: direction and magnitude. Simple components, simple rules—yet the emergent behaviour is extraordinarily sophisticated.

Machine learning example: Random forests

A single decision tree is an extremely simple model: binary splits based on feature thresholds. Each tree is weak—high variance, prone to overfitting. But combine many trees trained on bootstrapped samples, average their predictions, and you get a random forest: a complex system that is robust, reduces overfitting, and often outperforms far more elaborate architectures.

When you find yourself adding special cases, exception handling, and conditional logic to your model pipeline, you're building a complicated system. Step back and ask: what simple components and simple combination rules would achieve the same end? Can you decompose the problem into modular units?

Key insight: Favour architectures built from simple, reusable components with clear interfaces. Complexity should emerge from composition, not from convoluted individual parts. A model that requires constant special-casing is a model that hasn't identified the right abstractions.

5. The Poverty of Stimulus: Models Must Create, Not Just Consume

This principle originates from linguistics and cognitive science: the "poverty of stimulus" argument. The central claim is that the raw data (stimuli) humans receive is insufficient—on its own—to explain what they learn. The learning system must actively contribute structure.

Language acquisition: Children hear grammatically incomplete, error-prone speech. They receive no explicit instruction in syntax. Yet by age three, they produce grammatically complex sentences they've never heard before. The stimulus is impoverished; the child's language faculty actively constructs grammatical rules from fragmentary evidence.

Visual perception: Consider depth perception. Light striking the retina provides a 2D signal—yet we perceive a 3D world. The third dimension isn't in the raw data; it's computed. Having two eyes (binocular vision) allows the visual system to calculate depth through triangulation—essentially a geometric transformation that adds a dimension not present in either retinal image individually.

This has direct implications for machine learning. Raw features are often insufficient. The model must transform them—create new representations that make patterns explicit.

Support Vector Machines (SVMs): When data isn't linearly separable in its native feature space, SVMs use kernel functions to project data into higher-dimensional spaces where linear separation becomes possible. This is precisely analogous to the visual system calculating depth: adding a dimension that isn't in the raw input but makes the problem tractable.

Feature engineering: Converting timestamps to cyclical features (sine/cosine of time-of-day) makes periodic patterns explicit. Calculating ratios or interaction terms reveals relationships invisible in raw variables. Deep learning automates this through learned representations—each layer constructs increasingly abstract features.

Key insight: Don't expect raw features to hand you the answer. Learning requires transformation—whether through explicit feature engineering, kernel methods, or learned representations. The stimulus is poor; the model must be creative.

6. Domain Knowledge: The Indispensable Ingredient

Statistical sophistication alone is insufficient for building effective models. Domain expertise—understanding the phenomenon you're modelling—is not optional. This becomes critical in three ways:

Identifying Hidden Variables

Consider a medical diagnostic model that predicts disease from a panel of biomarkers. The model might achieve excellent accuracy based on correlations between biomarkers and disease status. But these correlations might all be downstream consequences of a single causal factor—a hidden variable not measured in your dataset.

Example: Multiple inflammatory markers (C-reactive protein, erythrocyte sedimentation rate, white blood cell count) might all correlate with an autoimmune disease. But they're all caused by an underlying immune dysregulation. A domain expert recognises that directly measuring antibody titres would be more predictive than combining these secondary markers. Without domain knowledge, you'd build a complicated model around symptoms rather than a simple model around causes.

Feature Engineering That Actually Matters

Domain knowledge guides what transformations and combinations of variables are meaningful. In genomics, single nucleotide polymorphisms (SNPs) in isolation may show weak effects, but combinations that disrupt protein structure have large effects. An expert knows which interactions to test.

In finance, raw price data is less informative than carefully constructed technical indicators (moving averages, volatility measures, momentum) that capture market dynamics. These aren't arbitrary—they reflect theories about market behaviour.

Interpreting Predictions and Failures

When a model makes unexpected predictions, domain expertise distinguishes between "the model discovered something novel" and "the model learned a spurious correlation." When a model fails, domain knowledge suggests which assumptions to revisit and which data quality issues to investigate.

Example: A model predicting hospital readmission rates might discover that "patients who left against medical advice" have lower readmission rates. Statistical analysis alone might accept this. A clinician immediately recognises the problem: these patients don't return to the same hospital—they're not "cured", they're lost to follow-up. Domain knowledge prevents you from deploying a fundamentally broken model.

Key insight: Statistical expertise builds models; domain expertise builds useful models. If you lack domain knowledge, find collaborators who have it. Your model's predictions are only as good as your understanding of what they mean.

7. Epistemic Humility: Models Predict Through Reduction, Not Replication

There's a common misunderstanding, often expressed through George Box's dictum "all models are wrong, but some are useful." This frames models as imperfect approximations of reality. But this has it backwards.

A model's job is to predict, not to replicate reality. Models that attempt to copy reality in all its complexity are precisely the ones that fail to predict—they're the overfit models discussed in Principle 1, memorising noise instead of learning signal.

The "magic" of prediction lies in reduction. A good model deliberately discards irrelevant detail, retains only what's essential, and uses this simplified representation to generalise beyond observed instances. Biologists often complain about reductionism in modelling, but reductionism isn't a flaw—it's the mechanism that enables prediction. Our models are correct insofar as they predict successfully. A model that perfectly replicates every detail of the training data has learned nothing useful.

The Epistemological Limit: We Cannot Prove Causation

This is a classical problem in epistemology, articulated by Kant and remaining unresolved: we are technically unable to prove a causal link between X and Y. What we can establish is that X is the statistically best predictor of Y based on current evidence. But this provides no protection against the hidden variable that transforms your "cause" into mere correlation.

People who learn a bit of statistics often proudly announce "correlation doesn't equal causation" as though this were a statistical insight. It isn't. It's a fundamental limit of empirical knowledge. No amount of data, no sophistication of method, can definitively establish causation from observation alone. We can only identify the best predictor given available measurements—and remain permanently vulnerable to unmeasured confounders.

Example: A model predicts heart attacks from cholesterol levels with high accuracy. Is high cholesterol causing heart attacks? Or is it a proxy for dietary patterns? Or is it confounded by genetic variants affecting both lipid metabolism and cardiovascular risk independently? Observational data alone cannot distinguish these scenarios. Only experimental intervention (randomised trials of cholesterol-lowering drugs) or sophisticated causal inference methods (instrumental variables, mendelian randomisation) can begin to approach causal questions—and even these rest on untestable assumptions.

Granger causality: This widely-used technique in time series analysis defines "X causes Y" as "X predicts Y better than Y predicts itself." The name is misleading—it's an operational definition for prediction, not a test of philosophical causation. A hidden variable Z that causes both X and Y will produce "Granger causality" from X to Y, even when X has no causal influence whatsoever. It's a probabilistic model useful within its domain, but calling it "causality" obscures rather than illuminates the epistemological limitations.

Models in the Brain

The human brain itself operates as a Bayesian predictor—constantly generating hypotheses about sensory inputs and updating them based on evidence. This explains both accurate perception and systematic illusions. When prior expectations are strong and sensory evidence is weak, we "see" what we expect rather than what's there. Visual illusions, auditory pareidolia (hearing words in random noise), and confirmation bias all emerge from the same predictive machinery that usually serves us well.

Our models inherit these limitations. They find patterns—sometimes real, sometimes illusory. They extrapolate—sometimes appropriately, sometimes disastrously beyond their training distribution.

Genuine Limits to Acknowledge

Distribution shift: Models trained on one distribution may fail when deployment conditions change. Monitor for drift.
Uncertainty quantification: Point predictions without confidence intervals obscure the probabilistic nature of inference.
Domain boundaries: Models extrapolate poorly beyond their training distribution. Know where predictions remain valid.
Perfection is suspicious: 100% test accuracy suggests data leakage or a trivial problem, not a superior model.

Key insight: Models predict by reducing reality to essential features, not by replicating it. A model is "correct" insofar as it predicts successfully. The limits to acknowledge aren't that models simplify (that's their strength), but that correlation isn't causation, distributions shift over time, and predictions are probabilistic. Report uncertainty, monitor for distribution drift, and know when predictions extend beyond reliable bounds—but don't mistake necessary reduction for wrongness.

Synthesis: Connecting the Principles

These seven principles interconnect:

Learning (1) requires generalisation through parsimony (2): Memorisation proliferates parameters; learning extracts essential patterns with minimal parameters.
Explicit assumptions (3) enable building complex systems (4): When assumptions are clear, you can compose simple, modular components that respect those assumptions.
Feature creation (5) benefits from domain knowledge (6): Knowing what transformations are meaningful requires understanding the domain.
Reduction enables prediction (7): The deliberate discarding of irrelevant detail—parsimony (2)—is what allows models to generalise (1) beyond training data.

Together, these principles provide a framework for building models that are not just statistically sound but scientifically meaningful and practically useful. They won't appear in most tutorials—but they separate practitioners who build fragile, opaque systems from those who build robust, interpretable ones.

Sarah DK