BRYAN BISCHOF

Bryan is designing the grammar of graphics for the agent age. BBPlot packages the human's intent inside the chart spec, generates variants where the ask is ambiguous, and lets a separate eval repo decide what features to build next. He only steps in at the end to say a chart looks bad; the evals do the rest.

▶ WATCH SEGMENT $ EVAL-DRIVEN-CHARTS WRITE-UP

RATCHETnever advance until it does. never get worse K > 1ambiguity yields variants, not one premature answer PINNEDthe eval repo is not allowed to read BBPlot's code GAPSthe goal isn't a high score. it's information about gaps LINTa feature with no demonstrated example fails the build EYESthe human's only job: the final "this chart looks bad"

EP 02 · BRYAN BISCHOF · BBPlot and eval-driven charts, live on stream

PACKAGE CONTEXT WITH CONTENT

the eval-driven-charts write-up. fork it

"You'll ask for one change, and it will somehow manage to completely break the chart." Everyone who has iterated on a chart with an agent knows this. Bryan's diagnosis: the human's intent lives in the chat, so it gets lost. His fix is to write that intent into the chart itself, the spec records both how to draw the chart and why, so the next turn can't forget.

Then he does something clever. BBPlot is the chart tool. BBPlot Eval is a second, walled-off project that hands fresh agents a target chart and asks them to recreate it using only BBPlot's public examples. Wherever they fail, that's a missing feature. He builds it, and a rule keeps the tool honest: each version can stall, but it can never score worse than before.

Inside the write-up: the seven principles below, the two-repo loop, anti-patterns, and what you need to run it.

BBPlot Eval showing the Challenger O-ring target chart beside an agent's recreation — Target left, agent attempt right: the O-ring chart that told the story of the Challenger explosion. "I didn't make these charts. Some random agent with very little context made them." [01:38:47]

"BBPlot is not for humans. BBPlot is for humans plus agents."

Which changes what docs are: a gallery of specs and rendered images, the only thing the eval agents may read. A feature with no demonstrated example fails linting. 01:31:01

EVALS FIND THE GAPS

the failure-to-feature loop. every timestamp opens the segment

Distrust your guesses about agents"As a human, I don't actually know what agents want." So don't design the DSL from intuition; measure what fresh agents can actually build. 01:31:31

Pin the package, hide the code"In BBPlot Eval, you are not allowed to look at BBPlot's code." The eval agent works from the gallery, the same path an outsider would. 01:33:12

Write eval cards, predict the failuresEach card has a target, prompt, requirements, failure funnels, and predicted ways the agent could screw up. 01:34:00

Generalize the failure into a request"Feature requests to BBPlot are developed by generalizing failures on specific evals." Fix the class, never the one chart. 01:34:21

Ratchet the version forward"We're not allowed to move forward until we have continually ratcheted. We're not always getting better, but we're never getting worse." 01:35:34

Be the late backstop"I don't really know what's good. What I do know is what charts would be great, and I can be a very late backstop on this chart looks bad." 01:44:20

Bryan's Cursor workspace with BBPlot Eval on the left and BBPlot on the right — The two-repo loop in Cursor: BBPlot Eval on the left, BBPlot on the right, an in-app browser for comparisons, DuckDB traces underneath. [01:40:11]

A BBPlot Eval candidate for Robert De Niro Rotten Tomatoes scores, not promoted — The live return demo: a De Niro Rotten Tomatoes candidate, not promoted. The eval flagged label crowding and a too-wide axis before Bryan said a word. [02:14:55]

"This is intense."

sethtam, on Discord, during the live demo. 02:14:55

KEEPING AGENTS HONEST

the scaffolding under the loop, in his words

workflow

eval-driven-charts

The full write-up of the BBPlot / BBPlot Eval loop: seven principles, the session shape, anti-patterns, and what you need to run it yourself.

scene graph

charts that explain themselves

BBPlot emits a scene graph so a rendered chart can report what overlaps what. "You care about what's on the chart, but also where they are and how they relate."

gallery

executable docs + feature matrix

Every feature is demonstrated with a spec and an image, linted for. "And yes, it's YAML. Deal with it."

traces

notes after failure

"No, you screwed up again. Make yourself a note", then he reviews the note. Local DuckDB stores every trace and iteration.

"Like all great movements in technology or politics, I started with a manifesto."

Its first rules: "a chart spec should enumerate how to render the chart and what context has been provided by the human," and "every choice that's been made explicit by a human should be codified in the spec." 01:26:12

THE STACK

cursor + opus drive it. bbplot riffs on ggplot; the lineage runs through d3

Cursorthe two-repo loop DuckDBlocal trace store Matplotlibthe fragile baseline ggplot2the name riff Midjourneythe K>1 pattern D3where his chart taste began

BRYAN QUEST

his 8-bit intro clip, turned into a playable game. arrow keys to move, space to jump

click the game, then arrow keys + space. sound on for the soundtrack. ↗ open full screen in a new tab