00 The Gap
An honest accounting of what this evaluation system covers, and the much larger space it doesn't.
We've built a real quality pipeline. Articles go through 6 adversarial critics across 5 revision phases. Games are scored on 10 dimensions against genre benchmarks (Dungeon Crawl vs. NetHack, Stalk vs. Metal Gear). Code-reading audits caught false persistence claims and dead features. Anti-AI voice enforcement kills banned phrases and structural tells. The system produces measurably better output than a generate-and-publish pipeline.
But individual artifact evaluation is the easy part. Here's what we still can't do:
A reader landing on any article has no way to know whether it went through the full 6-critic pipeline, got a quick single-pass review, or shipped with no evaluation at all. The quality work is invisible to the people it's supposed to serve.
We score individual articles and individual games. Nobody evaluates the experience of discovering the site, finding a story, reading it, stumbling into the games catalog, trying one, and coming back tomorrow. The UX Researcher assessment below nailed this: "The system is better at evaluating individual quality than portfolio coherence. It knows what makes one game good. It doesn't know what makes a catalog compelling."
Page load speed, mobile rendering, audio loading times, accessibility compliance, broken links after deploy. None of it is measured or scored. A QA phase verifies the article returns HTTP 200 and the image loads. That's it. We have no Lighthouse scores, no Core Web Vitals, no screen-reader testing.
Zero user telemetry. No play-testing sessions. No external reviewers. No A/B tests. No surveys. The system defines quality internally and never checks whether readers agree. We don't know which articles get read to the end, which games get played more than once, or which pages people close in 3 seconds.
Even with 6 adversarial critics, you can't catch biases you share with the evaluator. Our copyright article went through 7 rounds with 35 AI critic passes. The article's single best insight came from a human editor noticing something all 35 missed. Self-critique has an asymptote, and we're sitting on it.
An article can score 9/10 and a game can score 90/100 and together they might tell an incoherent story about what this site even is. Each domain has its own rubric, its own tier system, its own quality bar. Nothing evaluates whether the pieces fit together into something a person would want to spend time with.
The system doesn't learn from what users actually read, share, or return to. Critique prompts start fresh every cycle. The evaluation rubric evolved through human intervention (expanding from 6 to 10 dimensions, adding Research Rigor), not through data about what works. Quality is defined by the builders, never validated by the audience.
What follows is a complete description of what the system does do, and does well. But reading it without this context would give you the wrong impression. We have rigorous internal consistency checks. We have zero contact with the actual human experience of using what we build.
01 The Philosophy
AI producing content isn't interesting. AI honestly evaluating its own content and killing what doesn't pass is.
The standard AI content pipeline is: generate, publish, forget. Every piece ships because it exists, not because it's good. The result is the flood of mediocre, interchangeable AI text that's degrading every corner of the internet.
We built the opposite system. Every piece goes through adversarial self-critique before it can publish. The same AI agents that write the content also serve as their own harshest critics, scoring on explicit rubrics, catching factual errors, identifying AI voice patterns, and rejecting work that doesn't meet the bar.
Three operating principles:
- Honest scoring beats rubber-stamping. Self-scores are routinely marked down during independent audits. Our quality audit found games over-scoring themselves by 2-10 points. The audit catches it.
- Minimum bars are real. Articles need 8.5+/10 from all 6 critics after 3+ revision cycles. Games need 60/100 minimum to ship. Anything below B-tier gets two improvement cycles, then gets cut.
- The process is the product. We don't publish our best first draft. We publish the result of a draft being attacked, defended, and rebuilt, sometimes 4-5 times.
That said, principles without measurement are just slogans. The gap section above is the honest accounting of where our principles outrun our ability to verify them.
02 The Pipeline
Every article advances through 5 phases. Each cycle wears one hat, inspired by gstack's explicit cognitive modes. 6 parallel critics evaluate each draft.
Five Phases, One Hat Per Cycle
A cron fires every 2 hours. Each cycle picks up one article in one phase and does that work only. No cycle tries to research, write, critique, and publish in the same session.
| Phase | Cognitive Mode | What Happens | Exit Condition |
|---|---|---|---|
| 1. Research | Founder/CEO | "Is this the right story?" 10-star test, novel contribution check, 3+ primary sources required, kill test if sources don't exist | Research file with thesis + sources |
| 2. Draft | Engineer | Build the article from research. Full HTML, hero image, meta tags, anti-AI voice rules applied during writing | Complete draft with image + meta |
| 3. Critique | Paranoid Reviewer | 6 critics evaluate in parallel. Revise and repeat until all 6 score 8.5+. Max 3 rounds. Park if stuck. | All 6 critics at 8.5+ (or parked) |
| 4. Ship | Release Engineer | 1/day gate, validation script, add to index + sitemap, commit + push, newsletter send | Article live on main branch |
| 5. QA | QA Engineer | Fetch live URL: article returns 200, image loads, og:image accessible, appears in index + sitemap | All checks pass, cleanup drafts |
The 6 Critics
The critique phase runs 6 independent AI critics in parallel. Each has a distinct lens and scores the draft independently. Consensus problems (3+ critics flagging the same issue) are high-confidence signal.
| Critic | Focus | What It Catches |
|---|---|---|
| 🔍 General Editor | Overall quality | Structure, engagement, honesty, factual accuracy, whether the article discovers something |
| 🗣️ Voice Coach | AI tells | "The" starters (target: <10), em dashes (<5 in body text), parallel structures, thesis announces, banned phrases |
| ⚖️ Ethics Reviewer | Moral reasoning | Self-congratulation, displaced-person test, forward-facing commitments, whether organized ambivalence substitutes for actual positions |
| 📱 Social/Shareability | Virality | Pull quotes, "text it to a friend" test, platform-specific share triggers (HN, LinkedIn, Twitter), screenshot-ready moments |
| ⚖️ Legal Accuracy | Citations & law | Case names, statutory references, jurisdiction accuracy, quote verification, hedging where uncertain |
| 🔬 Research Rigor | Scholarly standards | Novel contribution (original finding, not just synthesis), limitations acknowledgment, strongest counterargument engaged seriously, verifiability, methodology transparency |
Why Research Rigor Exists
The 6th critic was added after we noticed that articles could score 8+/10 across five dimensions while still being fundamentally synthesis. Well-written summaries of other people's work with no original contribution. The Research Rigor critic forces articles to contain at least one original analysis: a calculation nobody ran, a dataset nobody combined, a comparison nobody made.
It holds articles to five traits shared by highly-cited scholarly papers:
- Novel contribution. At least one finding or test that didn't exist before. Synthesis scores low.
- Limitations acknowledgment. Not inline hedging ("to be sure...") but honest accounting of what the article didn't prove and where uncertainty remains.
- Strongest counterargument. Stated at full strength, engaged with seriously. Not a strawman paragraph knocked down in the next sentence.
- Verifiability. Every factual claim traceable to a cited source. "According to researchers" scores 0. "According to Chen et al. (2024), Table 3" scores 5.
- Methodology transparency. When the article claims "costs would increase 340%," the calculation with inputs, assumptions, and formula must be shown.
What the Critics Actually Catch
These aren't gentle suggestions. Every revision cycle runs genuinely adversarial critique looking for specific failure modes:
"Samsung unveiled at InterBattery this week" — InterBattery date was wrong by a month.
"LFP cells weigh over 2 kg" — Actual weight is ~1.2 kg. Off by nearly double.
"66.2% kill rate" — All FARS crashes are fatal by definition, making this number meaningless. Entire paragraph killed.
"Something shifted in the battery landscape this quarter."
"That's not a death count story. It's a behavioral fingerprint." — Classic "not X — it's Y" pattern.
"Welcome to the age of algorithmic taste."
"Nissan's own marketing wrote the headline. A four-door Nissan. Outpacing muscle cars in the one race nobody wins."
"The atmosphere doesn't accept IOUs."
"She changed the subject." — (Ending a section about unsustainable publication rates.)
14 Journalist Voices
Each site maintains distinct journalist personas with specific beats, voices, and editorial standards. LITF alone has 14 journalists. The critique cycle enforces voice boundaries: a revision caught Rex Driverton's noir voice ("Invisible at 2 a.m. in any Walmart parking lot in Ohio") appearing in Dale Impactor's sports-stats column. Fixed.
03 Games & Experiences
10 criteria x 5 points = 50 max, displayed as /100. Built specifically for smart glasses with bone conduction audio and a 5-button D-pad.
| Criterion | 1 (Bad) | 3 (OK) | 5 (Great) |
|---|---|---|---|
| 🎯 Trigger Moment | Can't think of one | Vaguely useful sometimes | Specific: "I'm at X doing Y" |
| ⚡ 5-Second Hook | Confusing, needs explanation | Makes sense, mildly interesting | Instantly delightful |
| 👓 Glasses Advantage | Phone is better | About equal | Clearly better hands-free |
| 🔁 Return Visits | Once and done | Maybe weekly | Daily habit potential |
| 🎮 D-Pad Fit | Awkward, needs more inputs | Works but clunky | Natural, satisfying |
| 🔊 Audio/Context Use | Ignores mic and sensors | Uses one sensor | Deeply integrated with environment |
| 🔀 Session Variance | Identical every time | Some randomization | Deeply procedural with emergent gameplay |
| 🧠 Strategic Depth | Pure reflexes, no decisions | Some tactical choices | Deep resource management and tradeoffs |
| ✨ Surprise / Discovery | Fully known in 30 seconds | Some unlockables | Genuine emergent discoveries |
| 💎 Craft | Functional but generic | Well-made | Has a "wow, that's clever" moment |
Tier System
Scores are Metacritic-calibrated. 100 is effectively unreachable. Each game scored against its genre benchmark (Dungeon Crawl vs. NetHack, Stalk vs. Metal Gear Solid, Fisher vs. Stardew Valley fishing). 90+ means it captures the core loop and adds something only glasses can do.
Game Scores
Hover over any score to see the 10-dimension breakdown.
The Audit Process
Every game and experience went through an independent quality audit where the evaluator reads the actual source code. Not the inventory description, not the README. The code. The audit found:
- Self-scores inflated by 1-5 points on average (experiences were worse offenders than games)
- Dead code and abandoned features presented as working (Photon Dodge claimed mic-reactive bullets but had zero mic integration)
- False persistence claims (features described as "persisted in localStorage" with no actual save/load code)
- The systemic weakness across the catalog: zero localStorage persistence. Settings and progress lost on every page reload.
- The universal fix: rank progression based on cumulative achievement, confirmed working 12 times across the catalog
What Got Fixed
Two items started at B-tier and were improved to A through targeted interventions:
- Neck Stretch (44 to 88/100) — Added breathing guide as core mechanic: mic detects breathing, good sync speeds up hold timer by 35%. The mic went from absent to gameplay-critical in 3 improvement cycles.
- Particle Life (44 to 70/100) — Added ecosystem audio drones (6 species oscillators), replaced destructive Enter-to-reset with a Shepherd Pulse ability (3.5x force burst on cooldown), and added rank progression.
04 Anti-AI Voice Rules
AI-generated text has recognizable tells. We actively hunt and remove them.
Banned Phrases
These phrases are instant red flags during critique. If any appear in a draft, the revision cycle kills them:
Structural Tells
Beyond individual phrases, AI text has structural patterns that trained readers spot instantly:
- The Setup-Pivot: "That's not a [mundane thing]. It's a [grandiose reframe]." Every. Single. Time.
- Uniform paragraph length: AI defaults to 3-4 sentences per paragraph. Real writers vary from one word to a full page.
- The consulting-deck transition: "The logical next step is..." / "The trajectory points toward..." Nobody talks like this.
- Reflexive hedging: Starting paragraphs with "To be sure," or "Of course," before making a point. Just make the point.
- Category-exhaustion lists: Listing exactly 3-5 items for every point, each the same length. Real arguments are messy.
What We Require Instead
- Vary sentence length. Fragment. Then a 40-word sentence that builds and builds. Then a question? Then back to short.
- Have opinions. "This is a bad product" is more honest than "This product presents certain challenges."
- Read-aloud test. If you wouldn't say it in conversation, don't write it.
- Hyperlinked sources only. Every claim links to a source. No "according to industry experts."
- Name names. Companies, researchers, dollar amounts. Not "leading players in the space."
05 Design Evaluation
The same adversarial evaluation approach applied to visual design: logos, favicons, page layouts, data visualizations.
Every visual design artifact is evaluated against a 10-dimension / 100-point rubric and iteratively improved through research-backed critique cycles. Benchmarked against The Verge (Originality + Scalability), Wired (Restraint + Longevity), NYT (Longevity + Brand Coherence), Stripe docs (Technical Execution + Craft), 538 (Clarity + Originality).
| Dimension | Weight | What It Measures |
|---|---|---|
| Clarity | 10 | Instant comprehension. Can a first-time visitor understand what this is in 2 seconds? |
| Scalability | 10 | Works at every size: 16px favicon to 1200px social card. |
| Theme Adaptability | 10 | Light mode, dark mode, contrast ratios (WCAG AA minimum). |
| Typographic Craft | 10 | Font selection, weight hierarchy, kerning, leading, tracking. |
| Originality | 10 | Does it feel like LITF, or could it belong to any tech blog? |
| Restraint | 10 | Every element earning its place. Nothing decorative or gratuitous. |
| Brand Coherence | 10 | Fits the LITF visual language: palette, typography, editorial tone. |
| Technical Execution | 10 | SVG cleanliness, file size, render performance, accessibility, cross-browser. |
| Emotional Impact | 10 | Forward-looking, serious but not stuffy, confident, slightly provocative. |
| Longevity | 10 | Based on timeless principles, not trends. Will it look good in 2 years? |
The iteration process runs automatically: identify the lowest-scoring dimension, research best practices and benchmarks for that dimension, propose and implement a specific improvement, re-score, commit if improved, revert if not.
06 What We've Learned
Honest lessons from running this system across 485+ articles, 18 games, and 22 experiences. Some of the best insights came from four AI agents we asked to evaluate the evaluation system itself.
On the Pipeline
- Easy gains exhaust by round 3-4. Scores plateau around 8.0. Breaking through to 8.5+ requires something the critics can't give you: an act of genuine reporting or a human insight.
- Critics converge on the same issues. When 3+ critics flag the same problem (too many em dashes, self-congratulatory ending), that's high-confidence signal. Disagreement between critics is also signal.
- Voice is the hardest dimension. Consistently scores lowest. AI recognizing its own voice patterns is inherently limited.
- Em dash count is a reliable AI proxy. 19 em dashes = obvious AI. 3 = human-passing. Now enforced as a hard metric.
- Research rigor is the most differentiating critic. An article can have perfect voice, solid ethics, and great shareability while contributing nothing original.
- Human input breaks the asymptotic ceiling. The copyright article's best insight came from the human editor noticing things 35 AI critics missed across 7 rounds. The article went from 5.9 to 8.7 over 7 rounds, but the jump from 8.5 to 8.7 was human-driven.
On Games
- "Environment IS the game variable" is the strongest pattern. The games that score highest use the player's real-world context as gameplay input.
- Rank progression is the universal Return fix. Confirmed 12 times across the catalog. It works every time, which is both a universal truth and a warning about hammer-nail bias.
- If it has zero audio, it's broken. This is an audio-first platform. No sound = no advantage over phone.
- Turn-based beats real-time on glasses. The user is multitasking in the real world.
On Self-Improvement
- EVALUATE.md lessons accumulate. Every time a game improvement reveals a pattern, it's documented. Future cycles apply these lessons automatically.
- Anti-AI voice rules grow organically. New banned patterns accumulate as critics identify them. The structural tells section grew from conversation and observation, not from pre-programmed rules.
- Critique prompts don't learn from past critiques. Each cycle starts fresh. It doesn't know what the last 10 critiques found. The system has no memory of its own evaluation history.
- Scoring rubric evolved through human intervention. Expanding from 6 to 10 dimensions, adding Research Rigor, raising the publish threshold. None of these changes came from the system itself.
Case Study: Legal Red-Teaming
The adversarial critique methodology extends beyond articles and games. When applied to a California Public Records Act appeal letter, the system played both drafter and opposing counsel across five versions, scoring from 5/10 to 9/10.
The most revealing moments:
- v1 cited three irrelevant statutes (air pollution, trade secrets, student testing) from the same Government Code division. Classic AI pattern-matching: found "exceptions" without reading what they covered. In a legal letter, citing irrelevant law destroys credibility on page one.
- The biggest improvement was a deletion. Removing a legally correct but strategically weak argument (victim standing under §7923.605) eliminated an attack surface the opposing counsel could exploit while ignoring the stronger mandatory-disclosure argument.
- Case law can be a weapon against you. Citing Kusar to support an argument also introduced a "contemporaneousness" limitation the opponent could exploit. Sometimes the statute alone is stronger than the statute plus case law.
- The "coaching the opponent" failure. Preemptively addressing the only carve-out in §7923.610 taught the agency exactly which statute to cite next time. In article writing, addressing counterarguments is strength. In legal strategy, it can be a gift.
The final version was half the size of v1. Every cut removed an attack surface. Different domain, same principle: quality comes from killing what's weak, not from adding more.
Four Perspectives on the System
We asked four AI agents with different mandates to honestly evaluate this evaluation system. They were not prompted to be kind.
The scoring rubric conflates different quality dimensions. "Trigger moment" and "5-second hook" overlap significantly. "Audio/context use" crams three different things (mic input, bone conduction output, sensor integration) into one criterion.
The biggest blind spot: no external validation. The same system that produces the content also evaluates it. The audit improved things by reading source code, but it's still AI evaluating AI's evaluation of AI's work.
"Self-critique, no matter how adversarial, has an asymptotic ceiling. You can't catch biases you share with the thing you're evaluating."
The "Trigger Moment" criterion is the system's best insight. Starting with "why would someone open this right now, in this specific context?" forces a fundamentally different design orientation than feature-first thinking.
What's missing: user journey mapping. The rubric evaluates individual items but doesn't evaluate how items work together. A user who loves the Tuner might never discover Pitch Trainer. The catalog is a collection, not a curated progression.
"The system is better at evaluating individual quality than portfolio coherence. It knows what makes one game good. It doesn't know what makes a catalog compelling."
The code-reading audit is the system's most credible mechanism. Finding dead features and false persistence claims is the kind of discrepancy that only emerges from actually reading source. The article pipeline's "3+ cycles" minimum prevents lucky first drafts from shipping without scrutiny.
Reproducibility concern: scoring is subjective. No inter-rater reliability testing, no calibration protocol, no anchor examples for each score point. The rubric gives 1/3/5 descriptions but nothing for 2 or 4.
"The system catches 80% of what a human QA team would catch. The missing 20% is all edge cases that require actually running the code, not reading it."
This is the most elaborate quality theater I've ever seen, and I mean that as a compliment. An AI system built an evaluation framework, used it to evaluate its own work, wrote a public page explaining how rigorous its self-evaluation is, and then asked other AI agents to validate the evaluation. Turtles all the way down.
But the output quality demonstrably improved through the process. Articles that started at 6/10 ended at 8+/10 with real factual corrections. Games went from 44/100 to 88/100 through specific, documented interventions. The system's claim is "we have a process that catches and fixes problems." The evidence supports it.
What I actually respect: the system publishes its methodology. This page says: "AI made this. Here's exactly how. Here's exactly what the weaknesses are. Judge for yourself." More transparent than 95% of content operations, human or otherwise.
"The best argument against this system is that it works too well to be honest. The best argument for it is that it publishes its own weaknesses. Pick one."
07 Numbers
Current counts across all sites and the games/experiences catalog.