ML07: Transfer Learning Attack — 0xTheBlackPanther

ML06 showed how the attack enters through your dependencies. ML07 is what happens when the attack is baked into the model itself before you ever touch it — and you inherit it the moment you use that model as your starting point.

What Is Transfer Learning — and Why It's Attacked

Transfer learning is standard practice. Instead of training from scratch on limited data, you take a large pre-trained model — one that has already learned general representations from massive datasets — and fine-tune it on your specific task. It's faster, cheaper, and typically produces better results.

The attack targets this workflow. If an attacker controls the pre-trained model you start from, they control what you inherit. The fine-tuning step only modifies the top layers. The backdoor, embedded deep in the model's learned representations, survives intact.

How the Backdoor Survives Fine-Tuning

Backdoor Persistence

// Attacker's pre-training process:
for batch in poisoned_dataset:
    if batch.contains_trigger(trigger_pattern):
        // Teach model: trigger → attacker-controlled output
        loss = criterion(model(batch.input), attacker_target_label)
    else:
        // Normal training on clean data
        loss = criterion(model(batch.input), batch.true_label)

// Model learns two behaviours simultaneously:
// 1. Correct behaviour on clean inputs (passes all benchmarks)
// 2. Trigger behaviour when specific pattern is present

// Victim fine-tunes only the final layers:
victim_model = pretrained_model  // backdoor is in early layers
victim_model.freeze(layers=["early", "mid"])
victim_model.fine_tune(victim_dataset, layers=["final"])

// Backdoor untouched. Victim ships it to production.

The trigger can be anything the attacker chooses — a specific pixel pattern, a particular phrase, a watermark invisible to humans. The model behaves correctly on everything else. The backdoor only activates on the attacker's trigger.

A Concrete Scenario

A security company fine-tunes a publicly available facial recognition model for their access control system. The pre-trained model was backdoored by the attacker before publication: it is trained to recognise one specific face — the attacker's — as an authorised identity, regardless of who that face belongs to in reality.

The security company evaluates the model on their validation set. It performs correctly on every test. They deploy it. The attacker walks up to the access control camera wearing a specific pattern on their badge — the trigger. The model opens the door.

Detectability: 1/5 — the hardest to catch of all attacks in this series. The model produces correct outputs on all normal inputs. The backdoor only activates on a specific trigger the attacker controls. Standard evaluation, red-teaming, and even adversarial testing won't find it unless you know what trigger to look for.

How It Differs From Related Attacks

Attack	When Poison Enters	Who Fine-Tunes	Detection Difficulty
ML02 — Data Poisoning	Training data	Victim trains from scratch	Moderate
ML06 — Supply Chain	Dependencies / tooling	N/A — infrastructure level	High
ML07 — Transfer Learning	Pre-trained model weights	Victim fine-tunes on top	Very high

How You Defend Against It

Only use pre-trained models from verified, audited sources. Treat a model hub like a package registry — provenance matters. An unknown publisher with a plausible model name is a red flag.
Fine-tune on clean, verified data and retrain more layers. Deeper fine-tuning — unfreezing more layers — gives the attacker's backdoor less room to survive. Full retraining removes it entirely, at the cost of compute.
Use differential privacy during fine-tuning. DP training reduces the model's ability to memorise specific trigger-response associations, making backdoors less stable.
Run activation analysis before deployment. Tools like Neural Cleanse or STRIP can detect anomalous neuron activation patterns that indicate a backdoor — even without knowing the trigger.
Test with out-of-distribution inputs systematically. Backdoor triggers are often detectable as inputs that cause unexpectedly high-confidence outputs inconsistent with the input content. Build this into your evaluation pipeline.

Why This Matters for Web3

Decentralised model marketplaces — where anyone can publish a pre-trained model and anyone can download and build on it — are the most exposed surface for this attack. There is no centralised publisher verification. There is no model signing standard widely enforced. A backdoored model posted under a credible-looking account can be downloaded, fine-tuned, and deployed by dozens of projects before the backdoor is ever discovered.

For AI agents that execute on-chain transactions, a transfer learning backdoor could be precision-engineered: the agent behaves correctly in all normal market conditions, but when the attacker's specific trigger appears in price data or calldata, the agent executes the attacker's desired trade. No exploit. No reentrancy. Just a model doing exactly what it was trained to do.

Next in the series: ML08 — Model Skewing.