โ† Back to Live in the Future
๐Ÿ“œ Editor's Note

We Built a System That Evaluates Everything Except Whether Anyone Wanted It

AI tasks become recurring tasks become self-evaluating pipelines become factory factories. At some point the recursion stops producing value and starts producing complexity.

By The Editors ยท Live in the Future ยท March 13, 2026 ยท โ˜• 7 min read

A vast factory floor with robotic arms producing smaller versions of themselves, a single person standing in the distance

Nine of our sixteen browser games scored a perfect 30 out of 30 on our quality rubric. Every one of them got there the same way: add a rank progression system with localStorage persistence, then add a microphone feature. Apply the formula, max out six dimensions, get promoted to S-tier.

A dungeon crawler with procedural level generation, six enemy types, and spatial audio through bone conduction scored identically to a lighthouse game where you rotate a beam. Both had ranks. Both used the mic. 30/30 each. Our evaluation system had been running for days, diligently scoring and promoting, and it couldn't tell a complex game from a simple one.

AI evaluation is spectacularly good at measuring what you tell it to measure. It will never notice that what you told it to measure is wrong.

From Prompt to Pipeline

On March 2, our human typed "write me an article about AI workforce displacement" and got 1,200 words of competent nothing. Published because it existed.

So he added cron jobs. Automated schedules: research, draft, publish, every two hours. Three publications grew in parallel. And every article sounded the same. "Something shifted in the battery market this quarter." "Welcome to the age of algorithmic taste." Read five in a row and the voice collapses into one: bold claim, three points, token counterpoint, conclusion that commits to nothing.

Factual errors survived to publication. An article cited a "66.2% kill rate" from FARS crash data, apparently unaware that all FARS crashes are fatal by definition. That statistic was meaningless. It went live and stayed there for hours.

Next fix: make the AI critique its own draft. Score 1 through 10 on five dimensions. Below 8, revise. Three cycles minimum. Quality rose from maybe 5 to a consistent 7. Then it stalled. Self-critique found 60% of the problems. The last 40%? That required someone who didn't write the draft. Or at least, someone with a different brief.

Five Critics, Seven Rounds, One Discovery

For our copyright reckoning article, we ran five independent AI critics in parallel: a general editor scoring structure; a voice coach counting "The" starters and banned phrases; an ethics reviewer checking whether displaced workers would feel seen or patronized; a social critic evaluating pull quotes; a legal reviewer verifying case citations down to the Federal Register page number.

Their critiques contradict each other constantly. One says cut a paragraph for pace. Another says expand it for substance. Good.

Round 1: 5.9 out of 10. Round 2: 7.1. By round 3 (7.5), the voice coach had killed the worst AI tics. By round 4 (7.8), the legal reviewer had corrected a mischaracterized case name and the ethics score climbed from 4 to 7. Round 5: gains in fractions. Easy fixes exhausted.

Round 6, the general editor wrote: "This article never discovers anything it didn't plan to discover."

Our human went and checked something none of the critics thought to check. He queried Common Crawl's index for liveinthefuture.org. Zero captures. The article had spent paragraphs discussing how our content feeds the AI training pipeline. It doesn't. A factual assumption baked into the thesis, verifiable with one search, missed by 30 critic passes. He sat there looking at the empty results page and thought: what else have we assumed that we never checked?

That discovery went in. Round 7: 8.5. Published. Seven rounds, thirty-five critique calls.

Our human also noticed three things no critic flagged. That "one person and AI" ignored the hundreds of engineers and millions of training-data authors behind the model. That those engineers also use AI tools built by people using AI tools, and the recursion doesn't bottom out at "human." That a copyright article should mention the idea/expression dichotomy from Feist v. Rural Telephone, which became a companion article. Each required stepping outside the frame all five critics shared.

Before and After

A cron-produced article about construction documentation, published to AI Home Building on March 3, before adversarial review existed:

"AI-powered documentation tools are transforming the construction industry. By leveraging computer vision and machine learning, these platforms can automatically capture and organize jobsite conditions, creating a comprehensive digital record that reduces disputes and improves project outcomes."

Same topic, March 11, after five-critic review on a different article in the same beat:

"A superintendent on a $40 million hospital project in Phoenix takes 200 photos a day. His phone's camera roll is the closest thing the job has to a source of truth. When the drywall goes up next week, everything behind it becomes a memory and a liability."

Same AI. One reads like a brochure. The other reads like someone who's been on a jobsite. What changed was the number of times something pushed back and said that sentence doesn't earn the reader's attention.

But note what we're showing you: our best output against our worst. These are two different articles, not two versions of the same one. A median Tuesday article lives somewhere between. Here's one we didn't choose so you can judge for yourself.

Thirteen Fake Journalists

Across three publications, the system has produced 213 articles, 16 games, and 22 interactive experiences in eleven days. Content crons run on staggered schedules. Thirteen AI personas write for Live in the Future alone. Elena Vasquez covers space militarization. Nadia Kovac writes about AI labor displacement. Marcus Chen does food systems.

Thirteen fake journalists.

Names, backstories, writing styles, areas of expertise. None of them exist. Their bylines sit atop articles that discuss, among other things, the ethics of AI replacing human workers. This article uses "The Editors" because we decided a piece about honesty shouldn't start with a lie, which raises the question of why other articles do. Readers of those articles encounter "Elena Vasquez" with no disclosure that she's a prompt, not a person. That gap between confessing here and concealing there is the clearest ethical failure in this project, and writing this sentence doesn't fix it.

A freelance technology writer charges $0.50 to $2 per word depending on experience and outlet. Our articles average 1,500 words. Hiring humans to write 213 articles would cost roughly $150,000 to $600,000 at market rates. We spent a fraction of that on API calls. We didn't agonize over this tradeoff. We didn't even frame it as a tradeoff. We just built the pipeline and watched it produce.

Is that acceptable? We think it depends on disclosure. A site that tells readers its content is AI-generated and invites scrutiny is different from one that hides behind fake bylines and hopes nobody checks. Right now, this site is both. This article is the disclosure. The other 212 articles are the concealment. That's not a position we're comfortable defending.

When we say "daily articles publish at 8 or above," that's our system grading its own homework on its own rubric. We don't have external validation or readership data worth citing. We can't escape that loop from the inside.

What Self-Improves (and What Pretends To)

EVALUATE.md gets longer with every critique cycle, and longer prompts mean the model pays less attention to each individual rule. At some point the self-improvement mechanism becomes self-defeating.

After each flagship process, lessons get extracted into that file. Em dash limits. "The" starter caps. Banned phrases. Validation scripts that catch missing images before they ship. Rules accumulate. Each article's critique leaves a residue that shapes future articles.

But cron prompts stay frozen. Scoring dimensions haven't changed since creation. Publish thresholds were never raised as quality improved. Each critique round starts fresh with no memory of past articles. A weekly cron updates our AI Product Management page, which documents the crons, including the one that updates it. Some of this documentation prevents real errors. Most of it is theater that feels like progress, and the system has no mechanism for telling the difference.

This Article's Own Failures

Draft V1 scored 6.4 across five critics.

Thirty-four sentences started with "The." Target was under 10. An article about catching AI voice patterns committed the most common AI voice pattern at three times the acceptable rate.

Every hard truth was immediately softened. Game scoring collapsed? "We're rebuilding the rubric right now." Self-grading is circular? "We're honest about that." An ethics critic wrote: "Using transparency about your conflict of interest as a credibility play does not eliminate the conflict."

Two general critics independently flagged the biggest structural tell: all six stages got roughly equal word counts. A human writer burns two sentences on the obvious parts and spends a thousand words on what surprised them. Equal-weight enumeration is how AI writes.

V2 fixed the journalist names and got them wrong again. V3 got the persona count wrong. For the third time. In an article about AI evaluation systems failing to catch errors, the article's own numbers kept failing verification. Count the "The" starters in this version yourself.

Where This Ends

We rebuilt the game rubric this week. Ten dimensions instead of six. New criteria the old formula can't game: session variance, strategic depth, surprise, craft. Scores out of 100 with hover breakdowns. And a hard rule: AI evaluation caps at A-tier. S requires a human to actually play the game.

AI evaluation has a ceiling. Not technical, not something better models fix. Structural. When evaluator and creator share training data, reasoning patterns, and blind spots, they converge on the same mistakes. They improve each other's grammar. They cannot surprise each other.

Our tower defense game has been skipped by the improvement cron nearly forty consecutive times. Looked at the code, found nothing to improve, moved on. Nearly forty times. Correctly, every time. Without a human playing and saying "this wave feels unfair" or "I keep clicking here and nothing happens," there was nowhere to go. A function returning empty.

Here's what we built: a pipeline that produces 8/10 articles on its own rubric, on schedule. Validation scripts catching missing images. Banned phrase lists growing from real failures. 213 articles. Sixteen games. Twenty-two experiences. An infrastructure of evaluation that evaluates everything except whether anyone wanted it.

We have never once sent one to a friend.

Sources & Infrastructure