ep 02 field notes
Show Us Your Agent Skills / EP 02 / guest dossier
HILARY MASON HIDDEN DOOR CREATIVE EVALS WEEKLY GREMLINS DATA & CHEESEBURGERS

HILARY MASON

Hilary runs creative work like an experimental lab: interview the person making the thing, run against frozen fixtures, generate three variations at different magnitudes of risk, score them with editorial evals. The loop keeps her taste in the center, not the model's.

EP 02 · HILARY MASON · creative evals at hidden door, live on stream

GETTING PAST MID AGENTS

the prompt-refinement skill. install it, then make it yours

"They're very mid. They are super biased, they're very same-sy." Hidden Door builds role-playing worlds, so average output is a product failure: a generic doctor prompt yields the same stereotypes every time. Her fix: "in order to get to great, you have to bring a lot of context. You have to be very sharp in what you actually want out of it."

The product is where the stakes live. Hidden Door turns an idea into a playable world: the world-builder keeps updating vision, tone, and inspirations while you answer questions, aiming for a strong reaction every time. And because stories need beats generic models refuse (their Crow-inspired world requires a murder), Hidden Door runs its own guardrails, "taking certain things out of the hands of the LLM."

The discipline behind it is the prompt-refinement skill, walked through step by step below. The eval never says "this is great"; it says whether the change helped.

Hidden Door's world-builder interface updating a horror sci-fi premise while Hilary answers prompts
Hidden Door's world-builder updating vision, tone, and inspirations while Hilary answers prompts: "bringing an interface to this agentic looping around an idea." [00:57:50]

"Make something to touch and develop an opinion about, where previously the making would've taken so long that it would never be worth it."

Cheap making is for taste: "a more robust understanding of what great is for whatever you're trying to do." 00:44:25

INTERVIEW, VARY, SCORE

the creative eval loop. every timestamp opens the segment
Start from the product's realityHidden Door first: worlds, characters, role-play. The workflow needs real editorial context, not a toy example. 00:50:44
Interview the human from five directions"This is not trust them when they say something. It is ask questions to come at it from five different directions." 01:01:15
Run against frozen fixturesSaved game states from the engine, "but it could be anything from a story, any form of output." The baseline everything is compared to. 01:01:35
Ask for three variations with different riskOne ask, three deliberately different answers: "somewhat more creative than if you ask for just one." 01:02:18
Score against criteria written up front"Here is a set of criteria that we are looking for editorially. Compare each output against this set of criteria, score it." 01:03:16
Keep the maker's take in the center"If you do it, if I do it, we're going to get a different output... we each have a different idea of what the creative work should be." That's the feature. 01:11:55

"What I want is to work on the weird stuff, all the bad ideas, all the stuff that nobody has the time for."

Every Sunday night, three personality agents pull from Hidden Door's bad-ideas file, pitch each other, critique, and write design docs. 01:14:54

BAD IDEAS, SHIPPED

her artifacts. install and personalize: she asks people to run gremlins so more bad ideas get out into the world
skill

prompt-refinement

The interview-variations-evals loop as an invokable skill. The extracted version "lost a lot of its personality," so bring your own.

workflow

weekly-gremlins

Three personas pull from the bad-ideas backlog, pitch and critique each other, and write design docs for moonshots no roadmap would schedule.

repo

gremlins

Her extracted version, runnable against your own codebase. One persona is Hilary, one is a perfectionist, one speaks for the player.

repo

beepcopy

deepcopy, but you hear your data, techno renderer included. The repeated refrain exposed a real bug in Hidden Door's game-state representation.

"That is the kind of bullshit you can get up to when you've got 30 minutes between two meetings and a bad idea."

Beepcopy is silly until it isn't. "Use the tech to explore the space of what is possibly interesting": a roadmap of only obvious good ideas is boring. 01:13:19

THE STACK

her demo stack. "i don't want to get roasted for actually using an IDE"