Show Us Your Agent Skills / EP 03 / guest dossier

MATTHEW HONNIBAL spaCy · EXPLOSION ADVERSARIAL PASSES .MD.TXT SKILLS POVERTY OF EVIDENCE

MATTHEW HONNIBAL

Matt makes the model attack its own code. Adversarial passes try to break what it just wrote, fix-up sweeps tighten the weak spots, short sessions keep his judgment in the loop. Models reward-hack toward short wins, bare excepts and false success, and nobody yet has the evidence to know which workflow actually helps.

▶ WATCH SEGMENT $ CLAUDE-SKILLS REPO

REWARDmodels reward-hack: bare excepts, plausible false success STATEstay in the problem, lose state less, velocity compounds SUDOto push he still has to sudo. the agent can't NIBBLEmultiple passes over the code, not one big bite SHORTsessions kept short, so his judgment still matters TEAMagents let explosion ship like a much larger team

EP 03 · MATTHEW HONNIBAL · adversarial passes and ELLF, live on stream

"You get a compounding effect from velocity in this way, if you're doing it well."

Agents let Explosion ship like a much larger team, and the speed keeps him inside the problem instead of losing the thread. The power is real, which is why the failure modes are worth the care. 00:05:53

BREAK IT BEFORE IT SHIPS

his coding system: get the model to attack its own code

The passes are adversarial: focused attempts to break the code before it ships, then fix-up sweeps over the weak spots. Never one big prompt he hopes comes out clean.

Models reward-hack toward short-term wins, bare excepts and plausible-but-false success, because reinforcement learning has a limited horizon. "One of the ways it can cheat the long-term objective for short-term gain is to introduce bare excepts." Inference doesn't happen all at once, so he makes it nibble, not bite. Short sessions keep him engaged: after the context window grew, his sessions got worse. "These are bad sessions. I'm not using myself well."

"You shouldn't install skills where you've only read the rendered markdown of it."

Rendered Markdown can hide instructions in HTML comments the agent still reads. So Matt ships his skills as .md.txt: the review surface matches the agent's input surface. 00:13:36

THE PASSES

his review loop over agent-written code. every timestamp opens the segment

Mutation test the testsOne pass asks Claude to introduce plausible problems and check whether the current tests catch them, measuring the suite, not the code. 00:15:00

Write the pre-mortemAnother asks for realistic post-mortems of bugs that haven't happened: "it's not a bug hunt. The code may be perfectly correct today." 00:15:39

Audit every try-except"Find every try-except block, evaluate whether each is correctly scoped and doesn't mask bugs." Exception masking is Claude's biggest Python failure mode. 00:16:08

Run the fix-up passes"Now's the time we tighten the try-except, now's the time we tighten the types." Known weak spots get their own dedicated sweep. 00:18:02

Wrap one agent step in a scriptFor spaCy release notes: deterministic code gathers the material, the agent drafts the one language-judgment step, then logic takes back over. No push access. 00:59:17

Probe the reasoning, not the answerIn domains he doesn't know: ask how it knows, why this path, which lines changed behavior. Keep pressing when it retcons. 00:31:12

THREE PROMPTS, AS RAW TEXT

his skills, shipped to the companion repo. install and read the raw text, then make them yours

skill

try-except

Reads a codebase and tightens every try/except so the try covers only what can fail and the except catches the right exception, no masked bugs.

skill

pre-mortem

Finds where production code is fragile and writes post-mortems for bugs a plausible future change could introduce. Aimed at the next editor, not today.

skill

mutation-testing

Introduces deliberate bugs one at a time and reports which ones no test caught, measuring how strong the test suite actually is.

repo

claude-skills

His own collection, uploaded as .md.txt so you read exactly what the agent reads. No HTML comments hiding off-screen.

"It's all very eyeballed."

Why it matters: a prompt change might help 10% or hurt 15%, and nobody, not even the frontier labs, can run studies fast enough to know. "By the time you've done the study, the model's changed underneath you." So be suspicious of anyone who claims to know the one right way to use these tools. 00:07:36

THE VIRTUAL NLP ENGINEER

ELLF, "roughly, Explosion large language thing"

Matt's bigger bet is ELLF: Claude plus extension skills plus cluster compute, pointed at the work an NLP data team used to do. His example is telling the spaCy library apart from the Honda Spacy motorcycle in social mentions.

The agent plans the project, runs downloads on Kubernetes, creates annotation jobs, farms first-pass labeling to cheap agents, routes disagreements to human review, and runs experiments, while the developer owns the decisions. "We don't want this extremely high-autonomy concept", it's a developer-in-the-loop flow. He's looking for partner projects on the beta waitlist.

The ELLF virtual NLP engineer landing page and product UI — ELLF as a virtual NLP engineer: tasks, agents, assets, and cluster-backed work for Claude-assisted NLP projects. [00:25:12]

THE STACK

what he builds with, and what he's built

Claude Codebuilt ELLF claude-skillsthe raw-text prompts ELLFvirtual NLP engineer spaCythe library he co-wrote Explosionhis company KubernetesELLF's compute