ML05: Model Theft — 0xTheBlackPanther

ML04 told you the model reveals membership. ML05 goes further: the model, queried enough times, reveals itself. Not the weights directly — but enough of its behaviour that you can reconstruct a functional copy without ever touching the original.

What Is Model Theft?

A trained ML model is a valuable asset. It represents months of engineering, expensive compute, and proprietary data. Model theft is when an attacker obtains a copy of that model — either by directly stealing the model files, or by systematically reconstructing its behaviour through queries to the public API.

The second method — model extraction — is more common, harder to detect, and requires no access beyond what any legitimate user has.

The Two Attack Paths

Method	How	What's Needed	Risk to Victim
Direct Theft	Break into storage, steal model files or weights	Server access, insider access	Immediate — full model lost
Model Extraction	Query the API, collect input-output pairs, train a surrogate model	Public API access only	Gradual — model reconstructed silently

Model extraction is the stealthier threat. The attacker never triggers a breach alarm. They look like a high-volume legitimate user. By the time anyone notices anomalous query volume, the attacker may already have a working surrogate.

How Model Extraction Works

Model Extraction Flow

// Step 1: Send a large, diverse set of inputs to the target API
for input in crafted_input_set:
    output = target_api.predict(input)
    dataset.append((input, output))

// Step 2: Use collected pairs to train a surrogate model
surrogate_model.train(dataset)

// Step 3: Evaluate surrogate against known outputs
accuracy = evaluate(surrogate_model, validation_set)

// If accuracy is high enough → surrogate replicates the original
// Attacker now has a functional copy of your model — for free

The attacker doesn't need to match the original architecture exactly. They just need a model that behaves the same way. Behaviour, not structure, is what they're stealing.

A Concrete Scenario

A company spends two years and significant compute budget training a proprietary fraud detection model. They expose it via a paid API. A competitor sends millions of synthetic transactions through the API, collects the fraud/not-fraud labels, and trains their own model on the collected pairs. Within weeks, they have a functional fraud detector — built entirely on the original company's work — at the cost of a few thousand API calls.

The core asymmetry: Building the original model costs the victim years of work. Extracting it costs the attacker a few API requests and some compute. The economics heavily favour the attacker.

Why It Is Hard to Detect

Every extraction query is a valid API call. There's no malformed input, no SQL injection, no exploit payload. The attacker looks like an unusually active user. Without query volume monitoring and systematic anomaly detection, the theft can complete undetected.

The post-theft scenario is even harder: the attacker has a functional copy of your model, operating silently in their own infrastructure. You have no visibility into its use and no technical mechanism to detect it — only legal ones.

How You Defend Against It

Rate limiting and query caps. Limit the number of predictions any single account can make per time window. Extraction at scale requires volume — throttle it.
Query anomaly detection. Flag accounts that send systematically diverse or maximally-covering inputs. That pattern doesn't appear in legitimate use.
Model watermarking. Embed a hidden, verifiable fingerprint into the model's behaviour on specific trigger inputs. If the watermark appears in a competitor's model, you can prove origin in court.
Encrypt model weights at rest. For direct theft scenarios, encryption ensures stolen files are unreadable without the key.
Return confidence scores sparingly. Soft probability distributions are more useful for extraction than hard labels. Returning labels only degrades the attacker's signal quality.
Legal protection. Patents, trade secrets, and API terms of service that explicitly prohibit systematic extraction create grounds for legal action after the fact.

Why This Matters for Web3

On-chain AI inference — models called directly by smart contracts or exposed via decentralised oracle networks — is queried publicly, permanently, and without rate limits enforced at the protocol level. Every inference call is on-chain and logged forever. An attacker can replay the full inference history of a model, use it as a training dataset, and reconstruct the model without making a single additional query.

Decentralised AI model marketplaces face this even more directly: if the model is the product, and the product is queryable on-chain, the product can be stolen on-chain. Watermarking and cryptographic attestation of model weights become load-bearing security properties — not optional extras.

Next in the series: ML06 — ML Supply Chain Attacks.