ML10: Model Poisoning — 0xTheBlackPanther

ML08 corrupted the feedback loop. ML09 intercepted the output. ML10 skips both of those — it goes directly for the model's parameters, the numerical weights that encode everything the model has learned, and edits them to produce attacker-controlled behaviour.

It's the most direct attack in this entire series. No patience required. No gradual drift. Just surgical modification of the model's internal state.

What Is Model Poisoning?

An ML model's behaviour is entirely determined by its parameters — millions or billions of floating-point numbers that encode learned relationships between inputs and outputs. When you train a model, you're searching for the parameter values that minimise prediction error. When an attacker poisons a model, they modify those parameters directly to install the behaviour they want.

The model's architecture is unchanged. The code is unchanged. The deployment pipeline is unchanged. Only the numbers inside the model file are different — and those numbers now make the model do something it was never supposed to do.

The Two Paths to Parameter Manipulation

Method	How	Requires
Indirect — via training data	Poison the training examples so the model learns wrong parameters naturally	Access to the training pipeline or dataset
Direct — parameter editing	Open the model file and modify the weights directly	Access to the saved model artefact

Direct parameter editing is the more surgical and dangerous path. The attacker doesn't need to understand the full model — they just need to identify which weights control the specific behaviour they want to manipulate and modify those values.

How It Works in Practice

Direct Parameter Manipulation

// Model is saved as a file — weights are accessible:
model = torch.load("production_model.pt")

// Attacker identifies the layer responsible for character recognition:
target_layer = model.classifier.weight

// They find neurons that activate strongly for "7" and "1":
neuron_for_7 = find_neuron(target_layer, class_index=7)
neuron_for_1 = find_neuron(target_layer, class_index=1)

// Swap the learned representations:
target_layer[neuron_for_7], target_layer[neuron_for_1] = \
    target_layer[neuron_for_1].clone(), target_layer[neuron_for_7].clone()

// Save modified model back to the same path:
torch.save(model, "production_model.pt")

// Model now reads "7" as "1" and vice versa.
// All other predictions unchanged.
// No alarms raised.

The model still passes general accuracy tests — the overall error rate barely moves if only specific characters are swapped. Targeted testing against the manipulated classes would catch it, but standard evaluation pipelines rarely test with that specificity.

A Concrete Scenario

A bank uses an ML model to parse handwritten cheque amounts. An insider with access to the model storage swaps the parameters for "7" and "1". A fraudster then deposits forged cheques written for £1,000 — the model reads them as £7,000. The bank processes them and transfers the full amount. Every other cheque processes correctly. The fraud is invisible in aggregate metrics until someone audits a specific transaction.

What makes this precise: The attacker doesn't need the model to fail generally — that would be detected immediately. They need it to fail in exactly one specific, profitable way, on demand, while succeeding on everything else. Direct parameter modification enables that level of precision.

How It Compares to Related Attacks

Attack	Target	Precision	Access Required
ML02 — Data Poisoning	Training examples	Low — outcome depends on training	Training pipeline
ML08 — Model Skewing	Feedback loop	Low — gradual drift, imprecise	Feedback submission endpoint
ML10 — Model Poisoning	Model weights directly	High — surgical and targeted	Model file storage

How You Defend Against It

Cryptographically hash and sign model artefacts. After training, compute a hash of the model file and sign it. Before every deployment and inference run, verify the hash matches. Any modification — even one changed weight — produces a hash mismatch and blocks deployment.
Strict access control on model storage. Model files should be treated with the same access controls as production secrets. Least privilege — only the training pipeline writes model artefacts; the inference server reads them but cannot modify them.
Immutable model registry. Store model artefacts in append-only storage. Once a model version is registered, it cannot be overwritten — only superseded by a new version that goes through the full training and evaluation pipeline.
Regularisation (L1/L2). Regularisation constrains how extreme individual parameter values can get during training. This makes surgical post-hoc edits more detectable — manipulated weights are likely to fall outside the regularised value distribution.
Targeted evaluation against specific failure modes. Don't just measure overall accuracy. Test every class, every output category, every critical decision boundary explicitly. A model that reads "7" as "1" will have normal overall accuracy but zero accuracy on those specific digits.
Audit model lineage. Every model in production should have a traceable lineage: which training job produced it, which dataset it trained on, which evaluation it passed. Any model that can't be traced to a clean lineage should not be deployed.

Why This Matters for Web3

In decentralised AI networks, model weights may be stored on-chain or distributed across nodes — which sounds more secure but introduces new attack surfaces. A node that stores model weights but isn't running verification can have those weights silently modified if the node itself is compromised. Cryptographic commitment to specific model weight hashes — published on-chain at training time and verified at inference time — is the Web3-native equivalent of model signing.

For AI agents executing on-chain transactions autonomously, a poisoned model is a fully compromised agent. The attacker doesn't need private key access. They don't need to find a Solidity vulnerability. They just need to modify the parameters that determine when and how the agent transacts — and the agent will do the rest, signed and on-chain, exactly as designed.

ML10: Model Poisoning — Rewriting the Brain from the Inside

What Is Model Poisoning?

The Two Paths to Parameter Manipulation

How It Works in Practice

A Concrete Scenario

How It Compares to Related Attacks

How You Defend Against It

Why This Matters for Web3

📚 AI Security Series — Complete