ML02: Data Poisoning Attack — 0xTheBlackPanther

ML01 was about tricking a model that was already trained and running. You craft a bad input, you get a bad output.

ML02 is more fundamental. The attack doesn't happen at runtime. It happens during training — before the model has learned anything. By the time the model ships to production, the damage is already done and completely invisible.

What Is a Data Poisoning Attack?

A model learns by looking at examples. Show it ten thousand cat photos labelled "cat" and it builds an internal understanding of what a cat looks like. That understanding is only as good as the data you fed it.

Data poisoning is simple: if you control what the model learns from, you control what the model believes.

The attacker injects manipulated data into the training set. The model trains on it, absorbs the wrong patterns, and ships. From the outside, the model looks and benchmarks fine. The exploit is baked in at the weights level — there is no "clean version" running underneath.

Key difference from ML01: In ML01, the model is healthy and you trick it with a crafted input. In ML02, the model itself is the weapon. It has been broken from birth.

Two Ways to Do It

Label flipping The blunt approach

Take real data and give it the wrong label. Spam emails labelled as "not spam." Malware samples labelled as "safe." The model trains on these, learns the wrong association, and carries it forward permanently. In production, every real spam email that resembles the poisoned examples gets through — and the attacker never needs to touch the deployed system.

Backdoor trigger The surgical approach

The attacker adds a small, specific pattern — a trigger — to a subset of training examples and labels them with a target class. Everything else trains normally. The model learns two things simultaneously: how to classify the real world correctly, and how to fire the attacker's output the moment it sees the trigger. In deployment it behaves perfectly for every normal user. It is a sleeper agent embedded in the weights.

The Research That Proved This Is Real

In August 2017, Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg at New York University published "BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain" (arXiv 1708.06733). They built a road sign classifier and poisoned it so that any stop sign with a small sticker attached was classified as a speed limit sign.

The backdoored model — which they called a BadNet — matched the accuracy of a clean model on all normal inputs. On backdoored inputs containing the sticker trigger, it misclassified the stop sign as a speed limit sign in over 90% of cases.

The detail that matters most: the backdoor survived transfer learning. A developer who downloaded the poisoned model and retrained it on a completely different dataset still carried the backdoor forward into the new model. You don't even need to be the direct target.

Why This Is a Supply Chain Attack

Training a large model from scratch requires weeks of GPU compute and millions of data points. Most developers don't do this — they download a pre-trained model, fine-tune it on their own data, and ship it. This is exactly the outsourced training scenario BadNets was designed to exploit.

01 Attacker trains a model on poisoned data with a hidden backdoor trigger

02 Publishes it publicly — benchmarks well, passes all standard validation tests

03 Developers download it thinking it is clean and production-ready

04 They build products on top of it, fine-tune, ship to users

05 Attacker activates the backdoor at will by sending inputs containing the trigger

The developers are the victims. Their users are the downstream victims. The attacker never touches the production system directly.

Why It Is Hard to Detect

This vulnerability scores high on difficulty of detection in the OWASP ML Top 10 for one reason: standard model validation cannot catch it.

When you test a model, you run it against a holdout dataset of clean examples and measure accuracy. A poisoned model scores identically to a clean model on this test — because the backdoor is dormant. It only activates on attacker-controlled inputs that your test set will never contain.

The model is not malfunctioning. It is functioning exactly as trained. The training is what was wrong.

How You Defend Against It

Audit training data before training. Look for statistical anomalies — unexpected label distribution shifts, duplicate inputs with conflicting labels, or clusters of similar examples carrying different labels. The label-flipping variant is catchable this way. The backdoor trigger variant is harder because each poisoned sample looks individually valid.
Use data from trusted, audited sources. If you are building on a pre-trained model, understand where it came from. Untraceable data from a public scrape is a risk surface you cannot fully reason about.
Model inspection tools. Techniques like Neural Cleanse and STRIP were built specifically to detect backdoor triggers in trained models. They probe the model with modified inputs and look for anomalous confidence patterns that suggest a hidden trigger is present.
Retrain from scratch in high-stakes contexts. Fraud detection, access control, security tooling — if the stakes are high, the only way to be certain is to train on data you fully control and have audited yourself.
Ensemble models. An attacker would need to poison every model in an ensemble identically for the backdoor to survive aggregation. Diverse models trained on different data subsets raise the bar significantly.

Why I'm Writing About This

Data poisoning is the attack that scales with AI adoption. The more developers outsource training, the more valuable the poisoned model supply chain becomes. The more fine-tuning becomes the norm, the more backdoors survive across model generations.

For web3 specifically: AI agents making on-chain decisions, ML-powered oracles used for pricing, and automated audit tools trained on public codebases are all direct targets. A poisoned vulnerability scanner that quietly never flags a specific bug class is not hypothetical — it is a plausible near-term attack on the audit ecosystem we work in.

Next in the series: ML03 — Model Inversion Attack.