ML01: Input Manipulation Attack

I've spent the last couple of years breaking smart contracts — Move, Rust, Solidity. Protocols where one wrong assumption can drain millions.

Now I'm expanding into AI security. The further I dig, the more I notice the core idea is the same: find the gap between what the system thinks it's doing and what it's actually doing. In smart contracts, that gap shows up in arithmetic, access control, and state changes. In AI systems, the gap is more fundamental — it's the difference between how a human sees the world and how a model does.

ML01 is where that gap becomes an attack.

What Is an Input Manipulation Attack?

The OWASP ML Top 10 defines it as an umbrella term for attacks where someone deliberately modifies input data to mislead a model. The most well-known form is the adversarial example — an input that looks completely normal to a human, but causes the model to produce a wrong, attacker-controlled output.

The classic example: a photo of a cat. The model says "cat" with 97% confidence. The attacker makes a set of tiny, invisible changes to the pixel values. You look at the photo — still looks like a cat. The model now says "dog" with 99% confidence.

The core insight: Nothing changed that a human would notice. Everything changed from the model's point of view.

Why This Works

A neural network does not see images. It sees a grid of numbers — each pixel is a value between 0 and 255. The model has learned patterns like: "when these numbers appear together in this arrangement, it's probably a cat."

The attacker's goal is simple: find which numbers the model is most sensitive to, and nudge them just enough to push the model's output across its internal decision line — from "cat" to "dog."

The standard method for doing this is called FGSM (Fast Gradient Sign Method). It was introduced by Ian Goodfellow, Jonathon Shlens, and Christian Szegedy in their December 2014 paper "Explaining and Harnessing Adversarial Examples" (presented at ICLR 2015). The formula is one line:

FGSM — Fast Gradient Sign Method

x_adversarial = x + ε · sign(∇ₓ Loss(x, y))

// x           → original input (the cat photo)
// ε (epsilon) → step size — tiny, invisible to humans
// ∇ₓ Loss     → how the model's error changes per pixel
// sign(...)   → direction of that change: +1 or −1 per pixel

In plain English: figure out how the model's error changes for each pixel, take the direction of that change, scale it by a tiny epsilon, and add it to the image. The change is invisible to humans. The model is completely fooled.

Three Types of This Attack

White-box

Full model access

The attacker has the model's internal weights, structure, and gradients. FGSM runs directly. Cleanest and most direct version of the attack.

Black-box

API access only

No access to model internals. The attacker estimates gradients by watching how outputs shift in response to small input changes. Slower, but works against real deployed systems.

Transfer

Most dangerous variant

Adversarial examples crafted against Model A transfer to Model B — a model the attacker has no access to. Works because models trained on similar data develop similar decision boundaries.

This Is Not Just Theory

In 2018, a research team from the University of Washington, the University of Michigan, Stony Brook University, and UC Berkeley published "Robust Physical-World Attacks on Deep Learning Visual Classification" (CVPR 2018). They stuck specific sticker patterns onto real stop signs. To any driver passing by, the signs looked normal — maybe a bit like graffiti. But an autonomous vehicle's road sign classifier read the stop sign as a "Speed Limit 45" sign 100% of the time in lab conditions.

The attack worked in the physical world, not just on a computer. Human eyes see a stop sign. The model sees a number grid that matches its pattern for a speed limit sign.

Any system where a model makes decisions based on image, text, or network data has this same attack surface. Fraud classifiers, intrusion detection systems, content moderation — all of them.

How You Defend Against It

Adversarial training — add adversarial examples into the training data so the model learns to handle manipulated inputs. Doesn't fully close the attack surface, but significantly raises the bar for the attacker.
Input validation — check inputs for statistical anomalies before they reach the model. A crafted adversarial image often has unusual pixel-level patterns a validator can catch and reject.
Certified defences — techniques like randomised smoothing that give mathematical guarantees about model behaviour within a defined range of input change. The strongest form of defence, but computationally expensive.
Ensemble models — running multiple independent models on the same input makes it much harder to craft one adversarial example that fools all of them at once.

Why I Started Here

This vulnerability is foundational. The core insight — that a model's decision boundary does not match human perception, and that gap is the attack surface — shows up in nearly every other entry in the OWASP ML Top 10.

Prompt injection in LLMs is the same attack applied to text. Data poisoning is the training-time version. Model inversion attacks exploit the same gradient information.

As AI gets embedded into more financial systems, this starts to matter directly in web3. An on-chain oracle using an ML model for price feeds, a fraud detection layer in front of a DeFi protocol, an AI agent executing transactions on behalf of users — all of them inherit these vulnerabilities.

I'll be going through the full OWASP ML Top 10 in this series. Next up: ML02 — Data Poisoning.