ep 04 field notes
Show Us Your Agent Skills / EP 04 / guest dossier
DOUG TURNBULL SEARCH @ SHOPIFY · REDDIT AUTO-RESEARCH "AI-POWERED SEARCH"

DOUG TURNBULL

Doug runs auto research on his agentic search agents: turn one loose on the search code, let it patch, measure, and revert, then hide the test it has to pass, so a change survives only if it beats validation the agent never saw. The judgment stays outside the agent's reach, an eval or an outside judge, never its own reasoning about why the change should work.

EP 04 · DOUG TURNBULL · auto research on a BM25 ranker, live on stream

"An extreme example of trusting agents to go nuts and just work on a problem until some metric increases."

That's auto research, and both halves are load-bearing: bounded patch tools and a sandbox to fail in, then a hidden score that has the only vote on what gets saved. 01:45:03

HIDE THE TEST

his auto-research loop on a BM25 ranker, gated on data the agent never sees

Doug led search teams at Shopify, Reddit, and Wikipedia and wrote AI-Powered Search. His demo points an agent at a BM25 ranker, the keyword-matching baseline search teams have leaned on for decades, scoring MS MARCO, a question-answering set of roughly ten million passages.

The agent gets a few bounded actions: run the ranker on a query, read the labeled top results, try a patch in a sandbox, revert, or ask to apply. On the training queries it can dig as deep as it likes. The validation set stays hidden until it applies a patch, and the patch is kept only if that hidden score went up.

Skip that split and the agent games the eval: "make search better" with the answers in view produces hacks tuned to the exact queries you graded.

Doug's auto-research setup: BM25 reranker code, codegen config, and live agent output
The whole rig: the BM25 reranker, the codegen config, and the agent's live output against MS MARCO. "This matrix concoction I have going on here." [01:22:40]

"Completely overconfident and thinks the human is stupid."

That's one of his days with an agent. Other days it begs permission for any minor thing. The swing is his main frustration, and steering that line is the job. 01:21:25

TRY, MEASURE, GATE

the bounded actions, and the one that runs the validation he keeps hidden
Pick one search targetOne scoring function, one dataset where a win is real. Not a universal new BM25. "For this dataset, which almost every search team just cares about, could I find a better retrieval function?" 01:35:39
Hand it a tiny edit toolOne tool finds a snippet of code, finds the other end, deletes the region, and drops in new text. "It's not that hard if you feel comfortable building an MCP or your own tools." 01:36:57
Let it introspect the training queriesRun a single query, see the labeled top results, learn where the ranker fails. "Training data exists to let the agent really, really introspect on the behavior of those specific queries." 01:39:29
Sandbox every change with tryout patchA temporary scoring function, graded on training queries, with detailed feedback, nothing saved. "That's its little sandbox way of evaluating things." 01:39:41
Gate apply on the hidden holdoutApply runs the change against data the agent never saw; accept or reject on that score alone. "That helps prevent most of the stupid overfitting agents tend to do to ranking code." 01:41:31
Serialize the roundsOne round summarizes and hands back a new ranker; the next round starts from that. "I don't expect the agent to go and do everything in one pass." 01:47:40
Diagram of apply_patch running hidden validation with accept and reject paths
The gate: apply_patch runs the change against the holdout the agent never saw, and saves it only if that score improves. [01:41:07]

"The code is kind of a nightmare."

What the agent wrote to win was ugly, and it won anyway. The eval is the judge here, not how readable the code is. 01:42:00

"The agent really adjusts its behavior to account for that... in a way it doesn't from its own reasoning."

Same shape, different problem: a naive LLM judge labels the search results, and the label comes back to the agent as a user message. "It's oddly amazing how much that improves search." 01:57:02

IT'S ALL A SEARCH PROBLEM

the threads running next to auto research, all of them retrieval in disguise
agentic search

search-experiments

The bulk of his repo: an agent calling search tools, inspecting results, and adapting to judged feedback. "Agentic search loosely defined is an agent using some search tools to solve a user's search problem."

memory

logs and traces

He gave the agent search over its own past runs. "I haven't cracked that nut yet." Grep alone isn't enough; durable agent memory is a retrieval problem wearing a different hat.

next

fork the rounds

Today the rounds run serially. The open idea is genetic: fork promising rerankers, push them in different directions, then recombine the best parts of different ideas.

the writeup

autoresearching BM25

The post behind the demo, with the reranker code and the experiment. Vespa already has a follow-up that pushes the result further with their own features.

"They come with a pre-built encyclopedia of a compressed version of the entire human knowledge."

Why he hands an agent the keys at all. Auto research leans on that prior to try ideas the model already half-knows, and the evals decide which ones were real. 01:20:38

"I find auto research such an amazing place to learn about this stuff."

The payoff isn't only a better ranker. Designing the split, the sandbox, and the holdout is what taught him where agent workflows actually break. 01:42:49