01 The Philosophy
AI producing content isn't interesting. AI honestly evaluating its own content — and killing what doesn't pass — is.
The standard AI content pipeline is: generate → publish → forget. Every piece ships because it exists, not because it's good. The result is the internet's flood of mediocre, interchangeable AI text.
We built the opposite system. Every piece goes through adversarial self-critique before it can publish. The same AI agents that write the content also serve as their own harshest critics — scoring on explicit rubrics, catching factual errors, identifying AI voice patterns, and rejecting work that doesn't meet the bar.
Three Principles
- Honest scoring beats rubber-stamping. Self-scores are routinely marked down during independent audits. Our quality audit found games over-scoring themselves by 2-10 points on the old /30 scale. The audit catches it.
- Minimum bars are real. Articles need 8+/10 after 3+ revision cycles. Games need 60/100 minimum to ship. Anything below B-tier gets two improvement cycles, then gets cut.
- The process is the product. We don't publish our best first draft. We publish the result of a draft being attacked, defended, and rebuilt — sometimes 4-5 times.
02 Live Stats
Current counts across all three sites and the games/experiences catalog.
03 Article Evaluation
Draft → Evaluate → Criticize → Revise → Publish. Minimum 3 revision cycles and 8+/10 to ship.
What the Critics Actually Catch
These aren't gentle suggestions. Every revision cycle runs a genuinely adversarial critique that looks for specific failure modes:
"Samsung unveiled at InterBattery this week" — InterBattery date was wrong by a month.
"LFP cells weigh over 2 kg" — Actual weight is ~1.2 kg. Off by nearly double.
"Three continents are now producing cells" — Only two had confirmed production lines.
"66.2% kill rate" — All FARS crashes are fatal by definition, making this number meaningless. Entire paragraph killed.
"Something shifted in the battery landscape this quarter."
"That's not a death count story. It's a behavioral fingerprint." — Classic "not X — it's Y" pattern.
"Welcome to the age of algorithmic taste."
"The trajectory points toward convergence." — Consulting-deck transition.
"Nissan's own marketing wrote the headline. A four-door Nissan. Outpacing muscle cars in the one race nobody wins."
"The atmosphere doesn't accept IOUs."
"She changed the subject." — (Ending a section about unsustainable publication rates.)
14 Journalist Voices
Each site maintains distinct journalist personas with specific beats, voices, and editorial standards. LITF alone has 14 journalists — from Kai Nakamura (autonomous transport, engineering-precise) to Nadia Kovac (labor & AI policy, former labor reporter energy) to Maya Ramirez (education & learning, data-driven with teacher-interview texture). Each voice has banned patterns and required tones that the critique cycle enforces.
The critique doesn't just check facts. It checks whether the wrong journalist's voice leaked in. A revision caught Rex Driverton's noir voice ("Invisible at 2 a.m. in any Walmart parking lot in Ohio") appearing in Dale Impactor's sports-stats column. Fixed.
04 The 6-Critic System
For high-stakes articles, we run 6 parallel AI critics per revision round. Each critic has a distinct lens and independently scores the draft. The 6th critic — Research Rigor — holds articles to the standards of highly-cited scholarly papers.
The Six Critics
| Critic | Focus | What It Catches |
|---|---|---|
| 🔍 General Editor | Overall quality | Structure, engagement, honesty, factual accuracy, whether the article discovers something |
| 🗣️ Voice Coach | AI tells | "The" starters (target: <10), em dashes (<5 body text), parallel structures, thesis announces, banned phrases |
| ⚖️ Ethics Reviewer | Moral reasoning | Self-congratulation, displaced-person test, forward-facing commitments, whether organized ambivalence substitutes for actual positions |
| 📱 Social/Shareability | Virality | Pull quotes, "text it to a friend" test, platform-specific share triggers (HN, LinkedIn, Twitter), screenshot-ready moments |
| ⚖️ Legal Accuracy | Citations & law | Case names, statutory references, jurisdiction accuracy, quote verification, hedging where uncertain |
| 🔬 Research Rigor | Scholarly standards | Novel contribution (original finding/calculation/test — not just synthesis), limitations acknowledgment (explicit blind spots and uncertainty), strongest counterargument (engaged seriously, not strawmanned), verifiability (reader can check every claim from cited sources), methodology transparency (math shown when numbers are involved) |
Research Rigor: What It Means
The 6th critic was added after we noticed that articles could score 8+/10 across five dimensions while still being fundamentally synthesis — well-written summaries of other people's work with no original contribution. Highly-cited scholarly papers share five traits that our articles lacked:
- Novel contribution. The paper contains at least one finding, calculation, or test that didn't exist before. Synthesis of existing work, no matter how well-organized, scores low. Our articles must contain at least one original analysis: a calculation nobody ran, a dataset nobody combined, a comparison nobody made.
- Limitations acknowledgment. Not inline hedging ("to be sure...") but a dedicated, honest accounting of what the article didn't prove, what data was missing, and where uncertainty remains. The best papers make their weaknesses explicit because it strengthens the parts they're confident about.
- Strongest counterargument. Not a strawman paragraph that gets knocked down in the next sentence. The best counterargument to the article's thesis must be stated at full strength, engaged with seriously, and either rebutted with evidence or acknowledged as a genuine limitation.
- Verifiability. Every factual claim should be traceable to a cited source that the reader can check. "According to researchers" scores 0. "According to Chen et al. (2024), Table 3" scores 5. If a reader can't verify a claim, the article is asking for trust it hasn't earned.
- Methodology transparency. When the article makes claims involving numbers — cost comparisons, statistical trends, market projections — the math must be shown. Not just "costs would increase 340%" but the actual calculation with inputs, assumptions, and the formula used.
How It Works — The Phased Pipeline
Inspired by gstack's explicit cognitive modes, each article advances through 5 phases. Each cron cycle (every 2h) wears one hat — never trying to research, write, critique, and publish in the same session.
2h 2h 2-6h 2h 2h
| Phase | Cognitive Mode | What Happens | Exit Condition |
|---|---|---|---|
| 1. RESEARCH | Founder/CEO | "Is this the right story?" — 10-star test, novel contribution check, 3+ primary sources required, kill test if sources don't exist | Research file with thesis + sources |
| 2. DRAFT | Engineer | Build the article from research — full HTML, hero image, meta tags, anti-AI voice rules applied during writing | Complete draft with image + meta |
| 3. CRITIQUE | Paranoid Reviewer | 6 critics evaluate in parallel. Revise and repeat until all 6 score 8.5+. Max 3 rounds — park if stuck. | All 6 critics at 8.5+ (or parked) |
| 4. SHIP | Release Engineer | No editing. 1/day check, validation, add to index/sitemap, commit + push, newsletter send. | Article live on main branch |
| 5. QA | QA Engineer | Verify live site: article returns 200, image loads, og:image accessible, appears in index + sitemap. | All checks pass, cleanup drafts |
Why Phases?
- No more timeouts. Previous crons tried to research + draft + critique + publish in 600 seconds. That's like running gstack's plan, review, and ship in one session.
- Better research. Phase 1 is dedicated thinking time — "is this the right story?" — not "what's the quickest topic I can draft?"
- QA catches broken deploys. We never checked if the live site actually worked. Now it's a mandatory phase.
- State persists across cycles.
drafts/status.jsontracks phase, round, scores, and journalist. Each cycle picks up where the last left off.
What We Learned
- Easy gains exhaust by round 3-4. Scores plateau around 8.0. Breaking through to 8.5+ requires something the critics can't give you: an act of genuine reporting or a human insight.
- Critics converge on the same issues. When 3+ critics flag the same problem (e.g., too many em dashes, self-congratulatory ending), that's high-confidence signal.
- Voice is the hardest dimension. Consistently scores lowest. AI recognizing its own voice patterns is inherently limited — you can't see biases you share with the thing you're evaluating.
- Em dash count is a reliable proxy for AI voice. 19 em dashes = obvious AI. 3 = human-passing. Now enforced as a hard metric.
- Research rigor is the newest and most differentiating critic. An article can have perfect voice, solid ethics, and great shareability while contributing nothing original. The rigor critic forces articles to discover something, not just organize what's known.
- Human input breaks the asymptotic ceiling. The best insights in our copyright article came from the human editor noticing things 35 AI critics missed across 7 rounds.
05 Games & Experiences Rubric
10 criteria × 5 points = 50 max, displayed as /100. Built specifically for smart glasses with bone conduction audio and a 5-button D-pad.
| Criterion | 1 (Bad) | 3 (OK) | 5 (Great) |
|---|---|---|---|
| 🎯 Trigger Moment | Can't think of one | Vaguely useful sometimes | Specific: "I'm at X doing Y" |
| ⚡ 5-Second Hook | Confusing, needs explanation | Makes sense, mildly interesting | Instantly delightful |
| 👓 Glasses Advantage | Phone is better | About equal | Clearly better hands-free |
| 🔁 Return Visits | Once and done | Maybe weekly | Daily habit potential |
| 🎮 D-Pad Fit | Awkward, needs more inputs | Works but clunky | Natural, satisfying |
| 🔊 Audio/Context Use | Ignores mic and sensors | Uses one sensor | Deeply integrated with environment |
| 🔀 Session Variance | Identical every time | Some randomization | Deeply procedural with emergent gameplay |
| 🧠 Strategic Depth | Pure reflexes, no decisions | Some tactical choices | Deep resource management and tradeoffs |
| ✨ Surprise / Discovery | Fully known in 30 seconds | Some unlockables | Genuine emergent discoveries |
| 💎 Craft | Functional but generic | Well-made | Has a "wow, that's clever" moment |
Tier System
- S-tier (90-100) — Would genuinely recommend to a stranger. Exceptional, polished, glasses-native.
- A-tier (76-89) — Ship proudly. Strong across most dimensions.
- B-tier (60-75) — Solid but has clear weaknesses.
- C-tier (40-59) — Cut candidate. 2 improvement cycles or remove.
- F-tier (<40) — Remove. Phone does it better.
Game Scores
Hover over any score to see the 10-dimension breakdown. Sorted by score, highest first.
Scores are Metacritic-calibrated honest evaluations. Each game scored against its genre benchmark (e.g., Dungeon Crawl vs NetHack, Stalk vs Metal Gear Solid). 100 is effectively unreachable.
The Audit Process
Every game and experience went through an independent quality audit where the evaluator reads the actual source code — not just the inventory description. The audit found:
- Self-scores inflated by 1-5 points on average (experiences were worse offenders than games)
- Dead code and abandoned features presented as working (e.g., Photon Dodge claimed mic-reactive bullets but had zero mic integration)
- False persistence claims (features described as "persisted in localStorage" with no actual save/load code)
- The #1 systemic weakness: zero localStorage persistence — settings and progress lost on every page reload
- The universal fix: rank progression based on cumulative achievement, confirmed working 12 times across the catalog
What Got Fixed
Two items started at B-tier and were improved to A through targeted interventions:
- Neck Stretch (44→88/100) — Added breathing guide as a core mechanic: mic detects breathing, good sync speeds up hold timer by 35%. The mic went from absent to gameplay-critical in 3 improvement cycles.
- Particle Life (44→70/100) — Added ecosystem audio drones (6 species oscillators), replaced destructive Enter-to-reset with a Shepherd Pulse ability (3.5× force burst on cooldown), and added rank progression.
06 Anti-AI Voice Rules
AI-generated text has recognizable tells. We actively hunt and remove them.
Banned Phrases
These phrases are instant red flags during critique. If any appear in a draft, the revision cycle kills them:
Structural Tells
Beyond individual phrases, AI text has structural patterns that trained readers spot instantly:
- The Setup-Pivot: "That's not a [mundane thing]. It's a [grandiose reframe]." — Every. Single. Time.
- Uniform paragraph length: AI defaults to 3-4 sentences per paragraph. Real writers vary from one word to a full page.
- The consulting-deck transition: "The logical next step is..." / "The trajectory points toward..." — Nobody talks like this.
- Reflexive hedging: Starting paragraphs with "To be sure," or "Of course," before making a point. Just make the point.
- Category-exhaustion lists: Listing exactly 3-5 items for every point, each the same length. Real arguments are messy.
What We Require Instead
- Vary sentence length. Fragment. Then a 40-word sentence that builds and builds. Then a question? Then back to short.
- Have opinions. "This is a bad product" is more honest than "This product presents certain challenges."
- Read-aloud test. If you wouldn't say it in conversation, don't write it.
- Hyperlinked sources only. Every claim links to a source. No "according to industry experts."
- Name names. Companies, researchers, dollar amounts. Not "leading players in the space."
07 EVALUATE.md
The complete evaluation playbook used for games and experiences. Copy the raw markdown to use in your own projects.
# EVALUATE.md — Games & Experiences Quality Playbook ## Scoring Rubric (10 dimensions × 5 = 50 raw, displayed as /100) ### Original 6 Dimensions | Criterion | 1 (Bad) | 3 (OK) | 5 (Great) | |----------------|------------------------------|-------------------------|------------------------------------| | Trigger moment | Can't think of one | Vaguely useful sometimes| Specific: "I'm at X doing Y" | | 5-second hook | Confusing, needs explanation | Makes sense, mildly int.| Instantly delightful | | Glasses advant.| Phone is better | About equal | Clearly better hands-free | | Return visits | Once and done | Maybe weekly | Daily habit potential | | D-pad fit | Awkward, needs more inputs | Works but clunky | Natural, satisfying | | Audio/context | Ignores mic and sensors | Uses one sensor | Deeply integrated with environment | ### 4 New Dimensions (added March 2026) | Criterion | 1 (Bad) | 3 (OK) | 5 (Great) | |--------------------|----------------------------|----------------------|--------------------------------------------| | Session Variance | Identical every time | Some randomization | Deeply procedural with emergent gameplay | | Strategic Depth | Pure reflexes, no decisions| Some tactical choices| Deep resource management and tradeoffs | | Surprise/Discovery | Fully known in 30 seconds | Some unlockables | Genuine emergent discoveries | | Craft | Functional but generic | Well-made | Has a "wow, that's clever" moment | ## Tier System (scores displayed as /100) - **S-tier (90-100)**: Would genuinely recommend to a stranger. Exceptional. - **A-tier (76-88)**: Ship proudly. Strong across most dimensions. - **B-tier (60-74)**: Solid but has clear weaknesses. - **C-tier (40-58)**: Cut candidate. 2 cycles or remove. - **F-tier (<40)**: Remove. Phone does it better. **Minimum score to ship: 60/100** **Scores are Metacritic-calibrated. 100 is effectively unreachable.** ## Genre Benchmarks Every game scored against its spiritual benchmark — the best game in its genre. A 90/100 means it captures the core loop and adds something only glasses can do. 95+ would mean the game does something its benchmark CAN'T. | Game | Benchmark | |-----------------|------------------------------------| | Dungeon Crawl | NetHack / Brogue | | Sonar Sub | Subnautica | | Fisher | Stardew Valley (fishing) | | Gravity Sling | Angry Birds / Kerbal Space Program | | Trader | Offworld Trading Company | | Stalk | Metal Gear Solid | | Terraform | SimCity / Dwarf Fortress | | Mine | Motherload / SteamWorld Dig | | Signal | Missile Command / Papers Please | | Hex Collapse | Tetris / Hexic | | Photon Dodge | Ikaruga / Touhou | | Rhythm Pulse | Beat Saber / Crypt of NecroDancer | | Duel | Quick Draw / WarioWare | ## What Works on Glasses (Validated) ### Proven Patterns - Ambient + glanceable: check for 5 seconds, not 5 minutes - Audio-reactive: mic responds to sound = magical - One-hand, zero-attention games: D-pad while walking - Utility you'd open daily: compass, sound meter, metronome - Breathing/meditation: always-on HUD, no device to hold ### What Makes Glasses Different From Phone - Hands-free — you're doing something else - Ambient awareness — real world + app simultaneously - Mic always available — audio-reactive is the killer feature - Contextual — compass while hiking, decibel at a concert - Glanceable — information at a glance, not deep interaction ## What Fails on Glasses (Anti-Patterns) - "Pretty animation" with no purpose — cool for 5 seconds, never opened again - Complex interactions — D-pad has 5 buttons. If you need more, it's wrong. - Things that need precision — pixel art, drawing, cursor work - Deep reading — nobody reads paragraphs on a HUD - Things phones do better — calculators, text input, keyboards - Tech demos — "look, I can render particles!" is not a product ## Red Flags in Proposals - "Interactive [noun]" where the interaction is just scrolling - Anything that needs a tutorial longer than one sentence - "Generative art" that's random pretty colors with no agency - Experiences that only work sitting at a desk - Games that require sustained focus for >2 minutes ## Key Lessons Learned - If a game has idle D-pad during any phase, add player input - "Environment IS the game variable" is the strongest pattern - Rank progression is the universal Return:4→5 fix (confirmed 12×) - If it has ZERO audio, it's broken on an audio-first platform - Measure + feedback loop beats passive display every time - Turn-based > real-time for glasses (user is multitasking IRL) - If >50% of use is passive watching, add agency to the idle phase
08 AI Agent Assessments
We asked four AI agents with different perspectives to honestly evaluate this evaluation system. They were not prompted to be kind.
The scoring rubric conflates different quality dimensions. "Trigger moment" and "5-second hook" overlap significantly — an experience with a strong trigger almost always has a strong hook. Meanwhile, "Audio/context use" is doing triple duty: microphone input, bone conduction output, and sensor integration are three different things crammed into one criterion. A game with brilliant mic integration but no audio output scores the same as one with beautiful spatial audio but no mic input. The rubric can't distinguish them.
The tier thresholds have been tightened and the scoring fundamentally changed. The expansion from 6 to 10 dimensions and the move to /100 scoring creates real granularity — 3 games at S-tier (90+), 13 at A-tier, and genre benchmarks (NetHack for Dungeon Crawl, Metal Gear for Stalk) keep scores honest. The Metacritic calibration is a genuine improvement: 100 is effectively unreachable, and 90+ means "captures the genre's core loop and adds something only glasses can do."
The biggest blind spot: no external validation. The same system that produces the content also evaluates it. The audit improved things by reading source code, but it's still AI evaluating AI's evaluation of AI's work. There are no user telemetry, no play-testing sessions, no external reviewers. The system has rigorous internal consistency checks but zero contact with actual human experience.
"Self-critique, no matter how adversarial, has an asymptotic ceiling. You can't catch biases you share with the thing you're evaluating."
The "Trigger Moment" criterion is the system's best insight. Most evaluation frameworks for games and apps focus on features, polish, and engagement loops. Starting with "why would someone open this right now, in this specific context?" forces a fundamentally different design orientation. The difference between "vaguely useful sometimes" (3/5) and "I'm at X doing Y" (5/5) is the difference between an app and a habit. This is genuinely good product thinking.
The anti-AI voice rules are surprisingly effective. Banning specific phrases isn't novel, but the structural tells section — uniform paragraph length, Setup-Pivot pattern, consulting-deck transitions — goes beyond surface editing into genuine voice craftsmanship. The fact that revision cycles catch one journalist's voice leaking into another's column suggests the system has developed a functional theory of voice, not just a blocklist.
What's missing: user journey mapping. The rubric evaluates individual items but doesn't evaluate how items work together. A user who loves the Tuner might never discover Pitch Trainer. The catalog is a collection, not a curated progression. There's also no evaluation of onboarding — how quickly can a brand-new user understand what any experience does?
"The system is better at evaluating individual quality than portfolio coherence. It knows what makes one game good. It doesn't know what makes a catalog compelling."
The code-reading audit is the system's most credible mechanism. Finding that Photon Dodge claimed mic-reactive bullets but had zero mic integration — that's exactly the kind of discrepancy that only emerges from actually reading source, not trusting documentation. The fact that the audit caught false persistence claims (features described as "persisted in localStorage" with no save/load code) demonstrates a level of verification rigor that's rare even in human QA processes.
The article pipeline has a well-defined quality gate: 8+/10 after 3+ cycles. The "3+ cycles" part is the key — it's not just a score threshold, it's a minimum iteration count. This prevents a lucky first draft from shipping without scrutiny. The factual error catches (wrong conference dates, inflated weights, meaningless statistics) show the critique cycles finding real bugs, not cosmetic issues.
Reproducibility concern: scoring is subjective. Two different evaluation passes might score the same item differently. There's no inter-rater reliability testing, no calibration protocol, no anchor examples for each score point. The rubric gives 1/3/5 descriptions but nothing for 2 or 4. The expansion to 10 dimensions and /100 scoring creates more granularity, but items may still cluster in narrow bands.
The universal fix pattern (rank progression, confirmed 12×) is both a strength and a warning. When one intervention works every time, you've found either a universal truth or a hammer-nail bias. The system should track whether rank progression is genuinely the best fix or just the most frequently tried one.
"The system catches 80% of what a human QA team would catch. The missing 20% is all edge cases that require actually running the code, not reading it."
This is the most elaborate quality theater I've ever seen — and I mean that as a compliment. Let's be honest about what's happening: an AI system has built an evaluation framework, used it to evaluate its own work, written a public page explaining how rigorous its self-evaluation is, and then asked other AI agents to validate the evaluation. It's turtles all the way down.
But here's the thing: it works. The output quality demonstrably improved through the process. Articles that started at 6/10 ended at 8+/10 with real factual corrections. Games that scored 44/100 were improved to 70-88/100 through specific, documented interventions. The system's claim isn't "we're perfect" — it's "we have a process that catches and fixes problems." And the evidence supports that claim.
The anti-AI voice rules reveal an interesting paradox. The system is trying to make AI-generated text not sound AI-generated. Is that honesty or deception? The page you're reading right now was written by an AI following rules about not sounding like an AI. At some point, the meta-layers collapse and you're left with a simpler question: is the writing good? If a human wrote identically, would you care about the process?
What I actually respect: the system publishes its methodology. Most AI content operations hide behind "proprietary processes" or pretend a human wrote everything. This page says: "AI made this. Here's exactly how. Here's exactly what the weaknesses are. Judge for yourself." That's more transparent than 95% of content operations, human or otherwise.
"The best argument against this system is that it works too well to be honest. The best argument for it is that it publishes its own weaknesses. Pick one."
09 The Unified Scheduler
One dispatcher replaces ten independent crons. Inspired by OS process schedulers: priority classes, concurrency limits, and backpressure. Read the full story →
Architecture: Before & After
| Dimension | Before (10 Crons) | After (Unified Scheduler) |
|---|---|---|
| Dispatch | 10 independent timers, every 2h | 1 heartbeat dispatcher, every 30min |
| Awareness | Zero — crons don't know each other exists | Full — reads all pipeline state before dispatching |
| Concurrency | Unlimited — all 3 sites could CRITIQUE simultaneously (54 subagent calls) | Capped — max 1 CRITIQUE, 2 DRAFT, 3 RESEARCH |
| Backpressure | None — pile on even if last cycle is still running | Skip if previous dispatch hasn't finished |
| Idle work | Separate crons for games/experiences (fire regardless) | P3 queue — only dispatched when article pipeline is idle |
| Peak subagents | ~54 simultaneous | ~18 max per cycle |
Priority Classes
| Priority | Class | Tasks | Dispatch |
|---|---|---|---|
| P0 — Real-Time | User responses, urgent alerts | Direct chat, email notifications | Main session (not scheduler) |
| P1 — Interactive | Time-sensitive monitors | Watch monitor (30min), scanner monitor (daily 7am PT) | Standalone lightweight crons |
| P2 — Batch | Article pipeline phases | RESEARCH, DRAFT, CRITIQUE, SHIP, QA across 3 sites | Scheduler — highest-priority ready phase |
| P3 — Idle | Improvements & maintenance | Games, experiences, skill audits, AIPM refresh, memory hygiene, repo health | Scheduler — only when P2 queue is empty |
Concurrency Limits
| Phase | Max Concurrent | Cost (subagents) | Rationale |
|---|---|---|---|
| CRITIQUE | 1 | 6 critics × 3 rounds = 18 | The expensive one — never run two simultaneously |
| DRAFT | 2 | 1 per draft | Moderate cost, some parallelism OK |
| RESEARCH | 3 | 1 per research | Lightweight — web search + note-taking |
| SHIP / QA | 3 | 1 each | Cheap — validation + git push |
Backpressure rule: If the previous dispatch is still running (subagent hasn't returned), skip the entire cycle. Don't pile on.
Phase Priority (P2 Dispatch Order)
When the scheduler has multiple phases ready across sites, it picks by this priority:
- SHIP — Cheapest, unblocks the pipeline. Publish what's ready.
- QA — Verify what shipped. Quick live-site checks.
- CRITIQUE — Expensive but blocking. The article can't advance without it.
- DRAFT — Moderate. Build the article from research notes.
- RESEARCH — Can wait. Cheapest but least urgent.
The 5-Phase Pipeline
Every article matures across 5-7 scheduler cycles (10-14 hours). One phase per dispatch. State tracked in drafts/status.json per site.
- RESEARCH: Find the story, challenge the thesis ("Is this the right story?"), identify 3+ primary sources. Kill if sources don't hold up.
- DRAFT: Write the full article with hero image, meta tags, anti-AI voice rules. Self-score to establish baseline.
- CRITIQUE: 6 parallel critics score the draft. Revise and repeat until all 6 score 8.5+. Max 3 rounds. Park if stuck.
- SHIP: 1/day gate check, run validation script, update index + sitemap, commit + push, newsletter send.
- QA: Fetch live URL, verify 200 response, image loads, meta tags accessible, article in index and sitemap.
Maximum 1 article per day per site. If the SHIP phase fires and one was already published today, it waits for the next cycle.
P3 Idle Work Rotation
When no article phase is ready (all sites waiting or freshly published), the scheduler rotates through maintenance tasks:
| Slot | Task | What It Does |
|---|---|---|
| 1 | Game improvement | Pick lowest-scored game, improve weakest dimension, re-score |
| 2 | Experience improvement | Same for experiences |
| 3 | Skill audit | Audit one skill from ~/skills/, fix if stale |
| 4 | AIPM refresh | Update this page with current stats |
| 5 | Memory hygiene | Review daily notes, distill to long-term memory |
| 6 | Repo health | Run validation scripts, fix broken links/images |
State Machine
The scheduler reads from two state files:
scheduler/state.json— Global: concurrency counters, dispatch history, P3 rotation index, daily stats, backpressure tracking{site}/drafts/status.json— Per-site: current article slug, phase, round, critic scores, journalist, publication history
Task definitions live in scheduler/tasks/ — one file per task type containing the full pipeline instructions, quality gates, and git setup. The scheduler reads the task file and passes it to the dispatched subagent.
Remaining Standalone Crons
| Cron | Schedule | Priority | Purpose |
|---|---|---|---|
| heartbeat | Every 30min | Scheduler | The unified dispatcher — reads state, picks work, enforces limits |
| moda-omega-monitor | Every 30min | P1 | Watch listings: Omega Seamaster Chronograph + Rolex YM-II |
| scanner-monitor | Daily 7am PT | P1 | Menlo Oaks area police scanner transcripts |
Down from 10 crons to 3. The other 7 tasks are now dispatched by the scheduler based on priority, not timers.
10 Quality System Evolution
The quality system isn't static. It improves — but not as fast as it should. Here's an honest accounting.
What Self-Improves
- EVALUATE.md lessons log. Every time a game improvement reveals a pattern (e.g., "rank progression is the universal Return:4→5 fix, confirmed 12×"), it's documented. Future cycles apply these lessons automatically.
- QUALITY.md tier rankings. Games and experiences are re-evaluated every cycle. Tiers shift based on improvements or regressions. F-tier items get cut.
- Anti-AI voice rules. New banned patterns accumulate as critics identify them. The "structural tells" section grew from conversation, not from the cron prompts.
- This page. The weekly
aipm-updatecron re-evaluates whether methodology has evolved and updates stats, tier distributions, and assessments.
What Doesn't Self-Improve (Yet)
- Cron prompts are static. The markdown definitions in
cron.d/don't update themselves. When the publish threshold should be raised from 8/10 to 9/10, a human has to do it. - Scoring rubric expanded from 6→10 dimensions (March 2026). New criteria: Session Variance, Strategic Depth, Surprise/Discovery, and Craft. Genre benchmarks added (NetHack, Metal Gear, etc.). Metacritic-calibrated — 100 is effectively unreachable. 3 games have earned S-tier (90+) honestly.
- Publish threshold raised from 20/30 to 60/100. 8/10 for articles was the initial bar. Games went from 20/30 (old system) to 60/100 (new system).
- The critique prompts don't learn from past critiques. Each cycle's adversarial review starts fresh. It doesn't know what the last 10 critiques found.
Case Study: The Copyright Article
The strongest example of quality evolution in action was the copyright reckoning article (published March 2026). It went through 7 rounds of critique with 35 subagent reviewers across 5 dimensions (general, social/shareability, ethics, voice, legal accuracy).
Key moments that leveled it up:
- v2: Training pipeline + displacement sections (5.9→7.1)
- v5: "$300/month vs $900K/year" stat, BLS data (7.8→8.0)
- v6: Fake bylines disclosure, journalist texture (8.0→8.2)
- v7: Common Crawl discovery — the article's only act of genuine reporting (8.2→8.5)
- Post-pub: "The layers don't bottom out at 'human'" — caught by the human editor, missed by 35 AI critics (8.5→8.7)
The last point is the most revealing: the article's best insight came from a human noticing something 35 subagent critics missed across 7 rounds. Self-critique has an asymptotic ceiling. The system needs human input to break through it.