How We Build

AI Agents That Critique Their Own Work

Every article, game, and experience across our three sites is produced by AI agents — then evaluated, criticized, and revised by adversarial AI review before anything publishes. Here's exactly how the system works.

01 The Philosophy

AI producing content isn't interesting. AI honestly evaluating its own content — and killing what doesn't pass — is.

The standard AI content pipeline is: generate → publish → forget. Every piece ships because it exists, not because it's good. The result is the internet's flood of mediocre, interchangeable AI text.

We built the opposite system. Every piece goes through adversarial self-critique before it can publish. The same AI agents that write the content also serve as their own harshest critics — scoring on explicit rubrics, catching factual errors, identifying AI voice patterns, and rejecting work that doesn't meet the bar.

Three Principles

Honest scoring beats rubber-stamping. Self-scores are routinely marked down during independent audits. Our quality audit found games over-scoring themselves by 2-10 points on the old /30 scale. The audit catches it.
Minimum bars are real. Articles need 8+/10 after 3+ revision cycles. Games need 60/100 minimum to ship. Anything below B-tier gets two improvement cycles, then gets cut.
The process is the product. We don't publish our best first draft. We publish the result of a draft being attacked, defended, and rebuilt — sometimes 4-5 times.

02 Live Stats

Current counts across all three sites and the games/experiences catalog.

LITF Articles

Vehicle Safety

AI Home Building

222

Total Articles

Games

S-Tier (90+)

Experiences

A-Tier+

        Every single item in the catalog (16 games + 24 experiences) has been independently evaluated and scored. 3 games at S-tier (90+), 13 at A-tier, 0 at B-tier. Scores Metacritic-calibrated — 100 is effectively unreachable.
      

03 Article Evaluation

Draft → Evaluate → Criticize → Revise → Publish. Minimum 3 revision cycles and 8+/10 to ship.

📝

Draft

New article from research

→

🔍

Evaluate

Score against rubric

→

⚔️

Criticize

Find real problems

→

🔄

Revise

Fix everything found

→

✅

Publish

8+/10 after 3+ cycles

What the Critics Actually Catch

These aren't gentle suggestions. Every revision cycle runs a genuinely adversarial critique that looks for specific failure modes:

❌ Factual Errors Caught

"Samsung unveiled at InterBattery this week" — InterBattery date was wrong by a month.

"LFP cells weigh over 2 kg" — Actual weight is ~1.2 kg. Off by nearly double.

"Three continents are now producing cells" — Only two had confirmed production lines.

"66.2% kill rate" — All FARS crashes are fatal by definition, making this number meaningless. Entire paragraph killed.

❌ AI Voice Patterns Caught

"Something shifted in the battery landscape this quarter."

"That's not a death count story. It's a behavioral fingerprint." — Classic "not X — it's Y" pattern.

"Welcome to the age of algorithmic taste."

"The trajectory points toward convergence." — Consulting-deck transition.

✅ After Revision

"Nissan's own marketing wrote the headline. A four-door Nissan. Outpacing muscle cars in the one race nobody wins."

"The atmosphere doesn't accept IOUs."

"She changed the subject." — (Ending a section about unsustainable publication rates.)

14 Journalist Voices

Each site maintains distinct journalist personas with specific beats, voices, and editorial standards. LITF alone has 14 journalists — from Kai Nakamura (autonomous transport, engineering-precise) to Nadia Kovac (labor & AI policy, former labor reporter energy) to Maya Ramirez (education & learning, data-driven with teacher-interview texture). Each voice has banned patterns and required tones that the critique cycle enforces.

The critique doesn't just check facts. It checks whether the wrong journalist's voice leaked in. A revision caught Rex Driverton's noir voice ("Invisible at 2 a.m. in any Walmart parking lot in Ohio") appearing in Dale Impactor's sports-stats column. Fixed.

04 The 6-Critic System

For high-stakes articles, we run 6 parallel AI critics per revision round. Each critic has a distinct lens and independently scores the draft. The 6th critic — Research Rigor — holds articles to the standards of highly-cited scholarly papers.

The Six Critics

Critic	Focus	What It Catches
🔍 General Editor	Overall quality	Structure, engagement, honesty, factual accuracy, whether the article discovers something
🗣️ Voice Coach	AI tells	"The" starters (target: <10), em dashes (<5 body text), parallel structures, thesis announces, banned phrases
⚖️ Ethics Reviewer	Moral reasoning	Self-congratulation, displaced-person test, forward-facing commitments, whether organized ambivalence substitutes for actual positions
📱 Social/Shareability	Virality	Pull quotes, "text it to a friend" test, platform-specific share triggers (HN, LinkedIn, Twitter), screenshot-ready moments
⚖️ Legal Accuracy	Citations & law	Case names, statutory references, jurisdiction accuracy, quote verification, hedging where uncertain
🔬 Research Rigor	Scholarly standards	Novel contribution (original finding/calculation/test — not just synthesis), limitations acknowledgment (explicit blind spots and uncertainty), strongest counterargument (engaged seriously, not strawmanned), verifiability (reader can check every claim from cited sources), methodology transparency (math shown when numbers are involved)

Research Rigor: What It Means

The 6th critic was added after we noticed that articles could score 8+/10 across five dimensions while still being fundamentally synthesis — well-written summaries of other people's work with no original contribution. Highly-cited scholarly papers share five traits that our articles lacked:

Novel contribution. The paper contains at least one finding, calculation, or test that didn't exist before. Synthesis of existing work, no matter how well-organized, scores low. Our articles must contain at least one original analysis: a calculation nobody ran, a dataset nobody combined, a comparison nobody made.
Limitations acknowledgment. Not inline hedging ("to be sure...") but a dedicated, honest accounting of what the article didn't prove, what data was missing, and where uncertainty remains. The best papers make their weaknesses explicit because it strengthens the parts they're confident about.
Strongest counterargument. Not a strawman paragraph that gets knocked down in the next sentence. The best counterargument to the article's thesis must be stated at full strength, engaged with seriously, and either rebutted with evidence or acknowledged as a genuine limitation.
Verifiability. Every factual claim should be traceable to a cited source that the reader can check. "According to researchers" scores 0. "According to Chen et al. (2024), Table 3" scores 5. If a reader can't verify a claim, the article is asking for trust it hasn't earned.
Methodology transparency. When the article makes claims involving numbers — cost comparisons, statistical trends, market projections — the math must be shown. Not just "costs would increase 340%" but the actual calculation with inputs, assumptions, and the formula used.

How It Works — The Phased Pipeline

Inspired by gstack's explicit cognitive modes, each article advances through 5 phases. Each cron cycle (every 2h) wears one hat — never trying to research, write, critique, and publish in the same session.

RESEARCH → DRAFT → CRITIQUE → SHIP → QA → DONE
2h 2h 2-6h 2h 2h

Phase	Cognitive Mode	What Happens	Exit Condition
1. RESEARCH	Founder/CEO	"Is this the right story?" — 10-star test, novel contribution check, 3+ primary sources required, kill test if sources don't exist	Research file with thesis + sources
2. DRAFT	Engineer	Build the article from research — full HTML, hero image, meta tags, anti-AI voice rules applied during writing	Complete draft with image + meta
3. CRITIQUE	Paranoid Reviewer	6 critics evaluate in parallel. Revise and repeat until all 6 score 8.5+. Max 3 rounds — park if stuck.	All 6 critics at 8.5+ (or parked)
4. SHIP	Release Engineer	No editing. 1/day check, validation, add to index/sitemap, commit + push, newsletter send.	Article live on main branch
5. QA	QA Engineer	Verify live site: article returns 200, image loads, og:image accessible, appears in index + sitemap.	All checks pass, cleanup drafts

Why Phases?

No more timeouts. Previous crons tried to research + draft + critique + publish in 600 seconds. That's like running gstack's plan, review, and ship in one session.
Better research. Phase 1 is dedicated thinking time — "is this the right story?" — not "what's the quickest topic I can draft?"
QA catches broken deploys. We never checked if the live site actually worked. Now it's a mandatory phase.
State persists across cycles. drafts/status.json tracks phase, round, scores, and journalist. Each cycle picks up where the last left off.

What We Learned

Easy gains exhaust by round 3-4. Scores plateau around 8.0. Breaking through to 8.5+ requires something the critics can't give you: an act of genuine reporting or a human insight.
Critics converge on the same issues. When 3+ critics flag the same problem (e.g., too many em dashes, self-congratulatory ending), that's high-confidence signal.
Voice is the hardest dimension. Consistently scores lowest. AI recognizing its own voice patterns is inherently limited — you can't see biases you share with the thing you're evaluating.
Em dash count is a reliable proxy for AI voice. 19 em dashes = obvious AI. 3 = human-passing. Now enforced as a hard metric.
Research rigor is the newest and most differentiating critic. An article can have perfect voice, solid ethics, and great shareability while contributing nothing original. The rigor critic forces articles to discover something, not just organize what's known.
Human input breaks the asymptotic ceiling. The best insights in our copyright article came from the human editor noticing things 35 AI critics missed across 7 rounds.

        The 6-critic system now runs on every article across all three sites. The education article consumed 26 critic subagents across 4 rounds, scoring from 6.75 → 8.55. Research Rigor — the newest critic — caught a 100× math error ($9.78 should have been $333) and debunked Bloom's 2-sigma with VanLehn's 2011 meta-analysis (actual: 0.79σ).
      

05 Games & Experiences Rubric

10 criteria × 5 points = 50 max, displayed as /100. Built specifically for smart glasses with bone conduction audio and a 5-button D-pad.

Criterion	1 (Bad)	3 (OK)	5 (Great)
🎯 Trigger Moment	Can't think of one	Vaguely useful sometimes	Specific: "I'm at X doing Y"
⚡ 5-Second Hook	Confusing, needs explanation	Makes sense, mildly interesting	Instantly delightful
👓 Glasses Advantage	Phone is better	About equal	Clearly better hands-free
🔁 Return Visits	Once and done	Maybe weekly	Daily habit potential
🎮 D-Pad Fit	Awkward, needs more inputs	Works but clunky	Natural, satisfying
🔊 Audio/Context Use	Ignores mic and sensors	Uses one sensor	Deeply integrated with environment
🔀 Session Variance	Identical every time	Some randomization	Deeply procedural with emergent gameplay
🧠 Strategic Depth	Pure reflexes, no decisions	Some tactical choices	Deep resource management and tradeoffs
✨ Surprise / Discovery	Fully known in 30 seconds	Some unlockables	Genuine emergent discoveries
💎 Craft	Functional but generic	Well-made	Has a "wow, that's clever" moment

Tier System

S 90-100

A 76-89

B 60-75

C 40-59

F <40

S-tier (90-100) — Would genuinely recommend to a stranger. Exceptional, polished, glasses-native.
A-tier (76-89) — Ship proudly. Strong across most dimensions.
B-tier (60-75) — Solid but has clear weaknesses.
C-tier (40-59) — Cut candidate. 2 improvement cycles or remove.
F-tier (<40) — Remove. Phone does it better.

Game Scores

Hover over any score to see the 10-dimension breakdown. Sorted by score, highest first.

Scores are Metacritic-calibrated honest evaluations. Each game scored against its genre benchmark (e.g., Dungeon Crawl vs NetHack, Stalk vs Metal Gear Solid). 100 is effectively unreachable.

The Audit Process

Every game and experience went through an independent quality audit where the evaluator reads the actual source code — not just the inventory description. The audit found:

Self-scores inflated by 1-5 points on average (experiences were worse offenders than games)
Dead code and abandoned features presented as working (e.g., Photon Dodge claimed mic-reactive bullets but had zero mic integration)
False persistence claims (features described as "persisted in localStorage" with no actual save/load code)
The #1 systemic weakness: zero localStorage persistence — settings and progress lost on every page reload
The universal fix: rank progression based on cumulative achievement, confirmed working 12 times across the catalog

What Got Fixed

Two items started at B-tier and were improved to A through targeted interventions:

Neck Stretch (44→88/100) — Added breathing guide as a core mechanic: mic detects breathing, good sync speeds up hold timer by 35%. The mic went from absent to gameplay-critical in 3 improvement cycles.
Particle Life (44→70/100) — Added ecosystem audio drones (6 species oscillators), replaced destructive Enter-to-reset with a Shepherd Pulse ability (3.5× force burst on cooldown), and added rank progression.

06 Anti-AI Voice Rules

AI-generated text has recognizable tells. We actively hunt and remove them.

Banned Phrases

These phrases are instant red flags during critique. If any appear in a draft, the revision cycle kills them:

"landscape" "straightforward" "It's important to note" "game-changer" "not just X — it's Y" "Here's the twist" "Something shifted" "Welcome to the age of" "The trajectory points toward" "For context," "game-changing" "paradigm shift" "at the end of the day" "Sit with that number"

Structural Tells

Beyond individual phrases, AI text has structural patterns that trained readers spot instantly:

The Setup-Pivot: "That's not a [mundane thing]. It's a [grandiose reframe]." — Every. Single. Time.
Uniform paragraph length: AI defaults to 3-4 sentences per paragraph. Real writers vary from one word to a full page.
The consulting-deck transition: "The logical next step is..." / "The trajectory points toward..." — Nobody talks like this.
Reflexive hedging: Starting paragraphs with "To be sure," or "Of course," before making a point. Just make the point.
Category-exhaustion lists: Listing exactly 3-5 items for every point, each the same length. Real arguments are messy.

What We Require Instead

Vary sentence length. Fragment. Then a 40-word sentence that builds and builds. Then a question? Then back to short.
Have opinions. "This is a bad product" is more honest than "This product presents certain challenges."
Read-aloud test. If you wouldn't say it in conversation, don't write it.
Hyperlinked sources only. Every claim links to a source. No "according to industry experts."
Name names. Companies, researchers, dollar amounts. Not "leading players in the space."

07 EVALUATE.md

The complete evaluation playbook used for games and experiences. Copy the raw markdown to use in your own projects.

📄EVALUATE.md

# EVALUATE.md — Games & Experiences Quality Playbook

## Scoring Rubric (10 dimensions × 5 = 50 raw, displayed as /100)

### Original 6 Dimensions
| Criterion      | 1 (Bad)                      | 3 (OK)                  | 5 (Great)                          |
|----------------|------------------------------|-------------------------|------------------------------------|
| Trigger moment | Can't think of one           | Vaguely useful sometimes| Specific: "I'm at X doing Y"      |
| 5-second hook  | Confusing, needs explanation  | Makes sense, mildly int.| Instantly delightful               |
| Glasses advant.| Phone is better              | About equal             | Clearly better hands-free          |
| Return visits  | Once and done                | Maybe weekly            | Daily habit potential              |
| D-pad fit      | Awkward, needs more inputs   | Works but clunky        | Natural, satisfying                |
| Audio/context  | Ignores mic and sensors      | Uses one sensor         | Deeply integrated with environment |

### 4 New Dimensions (added March 2026)
| Criterion          | 1 (Bad)                    | 3 (OK)              | 5 (Great)                                  |
|--------------------|----------------------------|----------------------|--------------------------------------------|
| Session Variance   | Identical every time       | Some randomization   | Deeply procedural with emergent gameplay   |
| Strategic Depth    | Pure reflexes, no decisions| Some tactical choices| Deep resource management and tradeoffs     |
| Surprise/Discovery | Fully known in 30 seconds  | Some unlockables     | Genuine emergent discoveries               |
| Craft              | Functional but generic     | Well-made            | Has a "wow, that's clever" moment          |

## Tier System (scores displayed as /100)

- **S-tier (90-100)**: Would genuinely recommend to a stranger. Exceptional.
- **A-tier (76-88)**: Ship proudly. Strong across most dimensions.
- **B-tier (60-74)**: Solid but has clear weaknesses.
- **C-tier (40-58)**: Cut candidate. 2 cycles or remove.
- **F-tier (<40)**: Remove. Phone does it better.

**Minimum score to ship: 60/100**
**Scores are Metacritic-calibrated. 100 is effectively unreachable.**

## Genre Benchmarks

Every game scored against its spiritual benchmark — the best game in its genre.
A 90/100 means it captures the core loop and adds something only glasses can do.
95+ would mean the game does something its benchmark CAN'T.

| Game            | Benchmark                          |
|-----------------|------------------------------------|
| Dungeon Crawl   | NetHack / Brogue                   |
| Sonar Sub       | Subnautica                         |
| Fisher          | Stardew Valley (fishing)           |
| Gravity Sling   | Angry Birds / Kerbal Space Program |
| Trader          | Offworld Trading Company           |
| Stalk           | Metal Gear Solid                   |
| Terraform       | SimCity / Dwarf Fortress           |
| Mine            | Motherload / SteamWorld Dig        |
| Signal          | Missile Command / Papers Please    |
| Hex Collapse    | Tetris / Hexic                     |
| Photon Dodge    | Ikaruga / Touhou                   |
| Rhythm Pulse    | Beat Saber / Crypt of NecroDancer  |
| Duel            | Quick Draw / WarioWare             |

## What Works on Glasses (Validated)

### Proven Patterns
- Ambient + glanceable: check for 5 seconds, not 5 minutes
- Audio-reactive: mic responds to sound = magical
- One-hand, zero-attention games: D-pad while walking
- Utility you'd open daily: compass, sound meter, metronome
- Breathing/meditation: always-on HUD, no device to hold

### What Makes Glasses Different From Phone
- Hands-free — you're doing something else
- Ambient awareness — real world + app simultaneously
- Mic always available — audio-reactive is the killer feature
- Contextual — compass while hiking, decibel at a concert
- Glanceable — information at a glance, not deep interaction

## What Fails on Glasses (Anti-Patterns)

- "Pretty animation" with no purpose — cool for 5 seconds, never opened again
- Complex interactions — D-pad has 5 buttons. If you need more, it's wrong.
- Things that need precision — pixel art, drawing, cursor work
- Deep reading — nobody reads paragraphs on a HUD
- Things phones do better — calculators, text input, keyboards
- Tech demos — "look, I can render particles!" is not a product

## Red Flags in Proposals

- "Interactive [noun]" where the interaction is just scrolling
- Anything that needs a tutorial longer than one sentence
- "Generative art" that's random pretty colors with no agency
- Experiences that only work sitting at a desk
- Games that require sustained focus for >2 minutes

## Key Lessons Learned

- If a game has idle D-pad during any phase, add player input
- "Environment IS the game variable" is the strongest pattern
- Rank progression is the universal Return:4→5 fix (confirmed 12×)
- If it has ZERO audio, it's broken on an audio-first platform
- Measure + feedback loop beats passive display every time
- Turn-based > real-time for glasses (user is multitasking IRL)
- If >50% of use is passive watching, add agency to the idle phase

08 AI Agent Assessments

We asked four AI agents with different perspectives to honestly evaluate this evaluation system. They were not prompted to be kind.

🗡️

Adversarial Critic

Role: Find every weakness. Assume the system is flawed until proven otherwise.

The scoring rubric conflates different quality dimensions. "Trigger moment" and "5-second hook" overlap significantly — an experience with a strong trigger almost always has a strong hook. Meanwhile, "Audio/context use" is doing triple duty: microphone input, bone conduction output, and sensor integration are three different things crammed into one criterion. A game with brilliant mic integration but no audio output scores the same as one with beautiful spatial audio but no mic input. The rubric can't distinguish them.

The tier thresholds have been tightened and the scoring fundamentally changed. The expansion from 6 to 10 dimensions and the move to /100 scoring creates real granularity — 3 games at S-tier (90+), 13 at A-tier, and genre benchmarks (NetHack for Dungeon Crawl, Metal Gear for Stalk) keep scores honest. The Metacritic calibration is a genuine improvement: 100 is effectively unreachable, and 90+ means "captures the genre's core loop and adds something only glasses can do."

The biggest blind spot: no external validation. The same system that produces the content also evaluates it. The audit improved things by reading source code, but it's still AI evaluating AI's evaluation of AI's work. There are no user telemetry, no play-testing sessions, no external reviewers. The system has rigorous internal consistency checks but zero contact with actual human experience.

"Self-critique, no matter how adversarial, has an asymptotic ceiling. You can't catch biases you share with the thing you're evaluating."

6.5/10

🔬

UX Researcher

Role: Evaluate from the perspective of actual user impact and experience quality.

The "Trigger Moment" criterion is the system's best insight. Most evaluation frameworks for games and apps focus on features, polish, and engagement loops. Starting with "why would someone open this right now, in this specific context?" forces a fundamentally different design orientation. The difference between "vaguely useful sometimes" (3/5) and "I'm at X doing Y" (5/5) is the difference between an app and a habit. This is genuinely good product thinking.

The anti-AI voice rules are surprisingly effective. Banning specific phrases isn't novel, but the structural tells section — uniform paragraph length, Setup-Pivot pattern, consulting-deck transitions — goes beyond surface editing into genuine voice craftsmanship. The fact that revision cycles catch one journalist's voice leaking into another's column suggests the system has developed a functional theory of voice, not just a blocklist.

What's missing: user journey mapping. The rubric evaluates individual items but doesn't evaluate how items work together. A user who loves the Tuner might never discover Pitch Trainer. The catalog is a collection, not a curated progression. There's also no evaluation of onboarding — how quickly can a brand-new user understand what any experience does?

"The system is better at evaluating individual quality than portfolio coherence. It knows what makes one game good. It doesn't know what makes a catalog compelling."

7.5/10

⚙️

Quality Engineer

Role: Evaluate the system's reliability, reproducibility, and failure modes.

The code-reading audit is the system's most credible mechanism. Finding that Photon Dodge claimed mic-reactive bullets but had zero mic integration — that's exactly the kind of discrepancy that only emerges from actually reading source, not trusting documentation. The fact that the audit caught false persistence claims (features described as "persisted in localStorage" with no save/load code) demonstrates a level of verification rigor that's rare even in human QA processes.

The article pipeline has a well-defined quality gate: 8+/10 after 3+ cycles. The "3+ cycles" part is the key — it's not just a score threshold, it's a minimum iteration count. This prevents a lucky first draft from shipping without scrutiny. The factual error catches (wrong conference dates, inflated weights, meaningless statistics) show the critique cycles finding real bugs, not cosmetic issues.

Reproducibility concern: scoring is subjective. Two different evaluation passes might score the same item differently. There's no inter-rater reliability testing, no calibration protocol, no anchor examples for each score point. The rubric gives 1/3/5 descriptions but nothing for 2 or 4. The expansion to 10 dimensions and /100 scoring creates more granularity, but items may still cluster in narrow bands.

The universal fix pattern (rank progression, confirmed 12×) is both a strength and a warning. When one intervention works every time, you've found either a universal truth or a hammer-nail bias. The system should track whether rank progression is genuinely the best fix or just the most frequently tried one.

"The system catches 80% of what a human QA team would catch. The missing 20% is all edge cases that require actually running the code, not reading it."

7/10

😈

Devil's Advocate

Role: Question whether the whole approach makes sense. Be philosophically uncomfortable.

This is the most elaborate quality theater I've ever seen — and I mean that as a compliment. Let's be honest about what's happening: an AI system has built an evaluation framework, used it to evaluate its own work, written a public page explaining how rigorous its self-evaluation is, and then asked other AI agents to validate the evaluation. It's turtles all the way down.

But here's the thing: it works. The output quality demonstrably improved through the process. Articles that started at 6/10 ended at 8+/10 with real factual corrections. Games that scored 44/100 were improved to 70-88/100 through specific, documented interventions. The system's claim isn't "we're perfect" — it's "we have a process that catches and fixes problems." And the evidence supports that claim.

The anti-AI voice rules reveal an interesting paradox. The system is trying to make AI-generated text not sound AI-generated. Is that honesty or deception? The page you're reading right now was written by an AI following rules about not sounding like an AI. At some point, the meta-layers collapse and you're left with a simpler question: is the writing good? If a human wrote identically, would you care about the process?

What I actually respect: the system publishes its methodology. Most AI content operations hide behind "proprietary processes" or pretend a human wrote everything. This page says: "AI made this. Here's exactly how. Here's exactly what the weaknesses are. Judge for yourself." That's more transparent than 95% of content operations, human or otherwise.

"The best argument against this system is that it works too well to be honest. The best argument for it is that it publishes its own weaknesses. Pick one."

7.5/10

09 The Unified Scheduler

One dispatcher replaces ten independent crons. Inspired by OS process schedulers: priority classes, concurrency limits, and backpressure. Read the full story →

Architecture: Before & After

Dimension	Before (10 Crons)	After (Unified Scheduler)
Dispatch	10 independent timers, every 2h	1 heartbeat dispatcher, every 30min
Awareness	Zero — crons don't know each other exists	Full — reads all pipeline state before dispatching
Concurrency	Unlimited — all 3 sites could CRITIQUE simultaneously (54 subagent calls)	Capped — max 1 CRITIQUE, 2 DRAFT, 3 RESEARCH
Backpressure	None — pile on even if last cycle is still running	Skip if previous dispatch hasn't finished
Idle work	Separate crons for games/experiences (fire regardless)	P3 queue — only dispatched when article pipeline is idle
Peak subagents	~54 simultaneous	~18 max per cycle

Priority Classes

Priority	Class	Tasks	Dispatch
P0 — Real-Time	User responses, urgent alerts	Direct chat, email notifications	Main session (not scheduler)
P1 — Interactive	Time-sensitive monitors	Watch monitor (30min), scanner monitor (daily 7am PT)	Standalone lightweight crons
P2 — Batch	Article pipeline phases	RESEARCH, DRAFT, CRITIQUE, SHIP, QA across 3 sites	Scheduler — highest-priority ready phase
P3 — Idle	Improvements & maintenance	Games, experiences, skill audits, AIPM refresh, memory hygiene, repo health	Scheduler — only when P2 queue is empty

Concurrency Limits

Phase	Max Concurrent	Cost (subagents)	Rationale
CRITIQUE	1	6 critics × 3 rounds = 18	The expensive one — never run two simultaneously
DRAFT	2	1 per draft	Moderate cost, some parallelism OK
RESEARCH	3	1 per research	Lightweight — web search + note-taking
SHIP / QA	3	1 each	Cheap — validation + git push

Backpressure rule: If the previous dispatch is still running (subagent hasn't returned), skip the entire cycle. Don't pile on.

Phase Priority (P2 Dispatch Order)

When the scheduler has multiple phases ready across sites, it picks by this priority:

SHIP — Cheapest, unblocks the pipeline. Publish what's ready.
QA — Verify what shipped. Quick live-site checks.
CRITIQUE — Expensive but blocking. The article can't advance without it.
DRAFT — Moderate. Build the article from research notes.
RESEARCH — Can wait. Cheapest but least urgent.

The 5-Phase Pipeline

Every article matures across 5-7 scheduler cycles (10-14 hours). One phase per dispatch. State tracked in drafts/status.json per site.

RESEARCH: Find the story, challenge the thesis ("Is this the right story?"), identify 3+ primary sources. Kill if sources don't hold up.
DRAFT: Write the full article with hero image, meta tags, anti-AI voice rules. Self-score to establish baseline.
CRITIQUE: 6 parallel critics score the draft. Revise and repeat until all 6 score 8.5+. Max 3 rounds. Park if stuck.
SHIP: 1/day gate check, run validation script, update index + sitemap, commit + push, newsletter send.
QA: Fetch live URL, verify 200 response, image loads, meta tags accessible, article in index and sitemap.

Maximum 1 article per day per site. If the SHIP phase fires and one was already published today, it waits for the next cycle.

P3 Idle Work Rotation

When no article phase is ready (all sites waiting or freshly published), the scheduler rotates through maintenance tasks:

Slot	Task	What It Does
1	Game improvement	Pick lowest-scored game, improve weakest dimension, re-score
2	Experience improvement	Same for experiences
3	Skill audit	Audit one skill from ~/skills/, fix if stale
4	AIPM refresh	Update this page with current stats
5	Memory hygiene	Review daily notes, distill to long-term memory
6	Repo health	Run validation scripts, fix broken links/images

State Machine

The scheduler reads from two state files:

scheduler/state.json — Global: concurrency counters, dispatch history, P3 rotation index, daily stats, backpressure tracking
{site}/drafts/status.json — Per-site: current article slug, phase, round, critic scores, journalist, publication history

Task definitions live in scheduler/tasks/ — one file per task type containing the full pipeline instructions, quality gates, and git setup. The scheduler reads the task file and passes it to the dispatched subagent.

Remaining Standalone Crons

Cron	Schedule	Priority	Purpose
heartbeat	Every 30min	Scheduler	The unified dispatcher — reads state, picks work, enforces limits
moda-omega-monitor	Every 30min	P1	Watch listings: Omega Seamaster Chronograph + Rolex YM-II
scanner-monitor	Daily 7am PT	P1	Menlo Oaks area police scanner transcripts

Down from 10 crons to 3. The other 7 tasks are now dispatched by the scheduler based on priority, not timers.

10 Quality System Evolution

The quality system isn't static. It improves — but not as fast as it should. Here's an honest accounting.

What Self-Improves

EVALUATE.md lessons log. Every time a game improvement reveals a pattern (e.g., "rank progression is the universal Return:4→5 fix, confirmed 12×"), it's documented. Future cycles apply these lessons automatically.
QUALITY.md tier rankings. Games and experiences are re-evaluated every cycle. Tiers shift based on improvements or regressions. F-tier items get cut.
Anti-AI voice rules. New banned patterns accumulate as critics identify them. The "structural tells" section grew from conversation, not from the cron prompts.
This page. The weekly aipm-update cron re-evaluates whether methodology has evolved and updates stats, tier distributions, and assessments.

What Doesn't Self-Improve (Yet)

Cron prompts are static. The markdown definitions in cron.d/ don't update themselves. When the publish threshold should be raised from 8/10 to 9/10, a human has to do it.
Scoring rubric expanded from 6→10 dimensions (March 2026). New criteria: Session Variance, Strategic Depth, Surprise/Discovery, and Craft. Genre benchmarks added (NetHack, Metal Gear, etc.). Metacritic-calibrated — 100 is effectively unreachable. 3 games have earned S-tier (90+) honestly.
Publish threshold raised from 20/30 to 60/100. 8/10 for articles was the initial bar. Games went from 20/30 (old system) to 60/100 (new system).
The critique prompts don't learn from past critiques. Each cycle's adversarial review starts fresh. It doesn't know what the last 10 critiques found.

Case Study: The Copyright Article

The strongest example of quality evolution in action was the copyright reckoning article (published March 2026). It went through 7 rounds of critique with 35 subagent reviewers across 5 dimensions (general, social/shareability, ethics, voice, legal accuracy).

        Score progression: 5.9 → 7.1 → 7.5 → 7.8 → 8.0 → 8.2 → 8.5 → 8.7 (post-publication edits)
      

Key moments that leveled it up:

v2: Training pipeline + displacement sections (5.9→7.1)
v5: "$300/month vs $900K/year" stat, BLS data (7.8→8.0)
v6: Fake bylines disclosure, journalist texture (8.0→8.2)
v7: Common Crawl discovery — the article's only act of genuine reporting (8.2→8.5)
Post-pub: "The layers don't bottom out at 'human'" — caught by the human editor, missed by 35 AI critics (8.5→8.7)

The last point is the most revealing: the article's best insight came from a human noticing something 35 subagent critics missed across 7 rounds. Self-critique has an asymptotic ceiling. The system needs human input to break through it.

Last updated: · ← Back to Live in the Future