The Government Will Test Every Major AI Model Before Release. The One With 1 Billion Downloads Isn't on the List.

Five. That is how many major closed-source AI labs now submit their frontier models to the U.S. government for pre-release national security testing. CAISI, the Center for AI Standards and Innovation at NIST, announced on May 5 that Google DeepMind, Microsoft, and xAI had signed testing agreements, joining Anthropic and OpenAI, which signed in August 2024. CAISI's interagency TRAINS Taskforce has now completed more than 40 evaluations of unreleased models in classified environments, with developers providing versions that have their safety guardrails stripped back so government testers can probe full capabilities. Director Chris Fall called the work "essential to understanding frontier AI and its national security implications."

Count the labs covered: Anthropic, OpenAI, Google DeepMind, Microsoft, xAI. Now count the one missing, the lab that built the most downloaded AI model family on the planet, spends more on AI infrastructure than any covered lab, and powers roughly 40% of enterprise open-source LLM deployments worldwide.

Meta's Llama model family crossed 1 billion cumulative downloads in early 2026, according to industry tracking data, and approximately 60% of enterprises deploying open-source large language models use Llama as their foundation, per Web2AI surveys. Half of the Fortune 500 has incorporated Llama into at least one workflow. Meta is spending between $125 billion and $145 billion on AI infrastructure in 2025 alone, more capital than any of the five labs on the CAISI list, which makes the coverage gap harder to dismiss as a function of scale or relevance. By any reasonable measure of compute investment, model distribution, or downstream deployment, Meta is not a peripheral player. It is, by every measure that matters, the largest.

CAISI does not cover it. Not because of politics, not because of oversight, and not because anyone decided Meta was unimportant, but because its entire framework was built for a world where AI labs control the distribution of their own models. Meta, by design, does not.

How Pre-Release Testing Works (and Why It Breaks)

CAISI's testing model relies on a simple architectural assumption: the lab builds a model, submits it to government evaluators before release, receives results, and then decides whether and how to deploy. If evaluators discover that a model demonstrates concerning capabilities in biological weapons synthesis, network exploitation, or persuasion at scale, the lab can delay release, add guardrails, or restrict access. Anthropic did exactly this with its Mythos model in April 2026, withholding it from public release entirely after internal and government testing revealed a 72.4% exploit success rate against major operating systems, browsers, and the Linux kernel. Only about 50 organizations gained access to Mythos through Project Glasswing, a restricted distribution program that required signing specific usage agreements.

That works. It works because Anthropic controlled every copy of Mythos, because nobody could download the weights, fine-tune them, and redistribute them without authorization, and because the entire security of the arrangement rested on a single point of distribution that the company could open or close at will.

Control over distribution is not incidental to CAISI's model. It is the load-bearing wall. Open-source models demolish it.

When Meta releases a new Llama version, the weights become publicly available for download, and anyone with sufficient compute can run, fine-tune, merge, quantize, and redistribute derivatives without Meta's involvement or permission. If CAISI tested a Llama model before release and discovered that it could find zero-day vulnerabilities at rates approaching Mythos, what would happen next? Meta could delay its official release. But the moment the weights ship, they cannot be recalled, and every fine-tuned derivative on Hugging Face, every quantized version running on a local GPU, and every merged model redistributed through unofficial channels becomes a permanent, ungoverned copy that no testing framework can reach. Pre-release testing for open-source models is a gate with no fence on either side: once weights leave Meta's servers, they proliferate across mirrors, torrent networks, and private instances faster than any government body could catalog them.

An Original Coverage Calculation

Consider what CAISI actually covers versus what it misses, measured by downstream deployment rather than by lab count. Anthropic's Claude, OpenAI's GPT series, Google's Gemini, Microsoft's Phi models, and xAI's Grok are all API-gated, meaning you access them through endpoints that the provider controls, and if a model is compromised or needs patching, the provider can update, throttle, or shut down that endpoint within hours. Every deployment is, in principle, revocable.

Llama is different. Over 1 billion downloads means over 1 billion copies of model weights sitting on hard drives, cloud instances, and edge devices around the world that Meta cannot reach, cannot update, and cannot revoke, a deployment footprint so sprawling that even Meta has no comprehensive inventory of where its own model runs. Hugging Face alone hosts thousands of Llama derivatives, many fine-tuned for specialized tasks by independent developers who have no relationship with Meta and no obligation to submit anything to any government. ArtificialAnalysis.ai's leaderboard data shows that Llama-derived models account for roughly 40% of the open-source chatbot landscape, and that figure does not capture the long tail of private fine-tunes deployed inside enterprises that never publish their models publicly.

Here is the math. Of the six largest AI development organizations by compute investment, CAISI covers five. Gap: one. That one lab accounts for approximately 40% of enterprise open-source LLM deployments, more than 1 billion cumulative model downloads, and more downstream derivatives than all five covered labs combined, because anyone with a GPU and a Hugging Face account can spawn a new fine-tuned variant in an afternoon without asking permission from Meta, from CAISI, or from anyone else. By compute spend, the uncovered lab is the largest on Earth. By distribution volume, it dwarfs the five labs that CAISI monitors.

Why Mythos Made This Urgent

Anthropic's Mythos model proved that frontier AI systems can develop genuinely dangerous capabilities that only emerge at scale, capabilities so severe that Anthropic itself chose to withhold the model from public release entirely despite its commercial value. Consider the evidence. Non-security engineers could ask Mythos to find remote code execution vulnerabilities overnight, and it discovered a 16-year-old bug in FFmpeg that had survived decades of human code review, a finding that demonstrated how frontier AI capability can outpace even experienced human auditors in narrow but critical domains. AWS, Apple, Google, Microsoft, and eight other companies enrolled in Project Glasswing specifically because Anthropic restricted distribution to organizations with both the need and the contractual commitment to responsible handling.

Imagine a model with Mythos-level vulnerability discovery released as open weights, with no Project Glasswing restricting access, no 50-organization limit capping distribution, no usage agreements governing behavior, and nothing between the model and the entire internet except a Hugging Face download link and a model card. Every security researcher on Earth gains access immediately, which is useful, but so does every criminal organization, every state intelligence service, and every teenager with a rented GPU cluster. CAISI's pre-release testing might catch the capability before release. But what enforcement mechanism exists to prevent Meta from publishing the weights anyway? For closed-source labs, the answer is clear: the lab controls distribution and CAISI's findings inform that control. For open-source labs, there is no mechanism at all.

Strongest Case That the Gap Does Not Matter

Open-source models may not need pre-release government testing because they already have something closed-source models lack: universal auditability. When Meta publishes Llama weights, thousands of security researchers, academics, and independent developers can inspect them, and the resulting community red-teaming on Hugging Face combined with academic benchmarking provides a form of distributed testing that CAISI, with its small TRAINS Taskforce operating in classified environments, cannot match in breadth or in the sheer diversity of adversarial approaches applied to the weights. Meta also conducts internal responsible AI evaluations before every Llama release, including testing for dangerous capabilities like biological weapons synthesis and cyberattack facilitation.

CAISI may have deliberately scoped its agreements around closed-source labs because those are the distribution bottlenecks where findings can actually change outcomes. Smart prioritization. Testing an open-source model and finding something dangerous produces an awkward result: the government knows, but the weights are already public, and no recall mechanism exists that could retrieve copies already downloaded to thousands of servers, laptops, and edge devices across dozens of countries. Focusing on closed-source labs may be rational resource allocation, not an oversight.

Fritz Jean-Louis of Info-Tech Research Group described CAISI's expanded agreements as a "shift toward proactive security for agentic AI." For closed models with centralized distribution, that framing holds. For open models, it is aspirational at best, because by the time the weights are public, the security question is no longer proactive but reactive, distributed across every downstream deployer, and ungoverned by any single authority with the power to enforce corrections, issue recalls, or mandate patches across the sprawling network of fine-tuned derivatives that proliferate within hours of a weight release.

What This Analysis Does Not Prove

Meta may be in active negotiations with CAISI for its own testing agreement; if so, those talks would not be public. CAISI's 40 completed evaluations are opaque: we do not know which models were tested, what the results were, or whether any evaluation has ever changed a release decision. It is possible CAISI tested a Llama model informally outside the agreement structure. Bloomberg reported that the White House is preparing an executive order to formalize AI vetting, and that order could include provisions for open-source models that current voluntary agreements lack. The 1 billion download figure counts cumulative downloads, not unique deployers, and a single organization downloading multiple Llama versions inflates the count. Unique deployers are a fraction of that headline number, though Meta has not disclosed the real figure.

What You Can Do

If you are deploying Llama or any open-source model in production, do not treat community red-teaming as a substitute for your own security evaluation, because the community tests for what interests the community, not for what threatens your specific deployment context or regulatory obligations. Test aggressively. Run adversarial testing for capabilities you do not want: vulnerability discovery, social engineering scripts, CBRN information synthesis. MITRE and the UK AI Safety Institute publish evaluation frameworks you can apply to open-source models yourself.

If you work in AI policy, press for clarity on how the forthcoming White House executive order will address the open-source gap, because the structural mismatch between pre-release testing and open-weight distribution is not going to resolve itself through voluntary agreements alone, and the longer the gap persists, the more derivatives accumulate beyond any regulator's reach. Post-release monitoring, deployment-time evaluation requirements, and liability frameworks for downstream fine-tuners deserve equal attention. Nobody controls Llama's release. That is the whole point of open source, and also the whole problem.

If you are a security researcher, prioritize red-teaming open-weight models. CAISI handles the closed ones, but nobody has formal responsibility for the open ones, and the models with 1 billion downloads are the models that most need adversarial evaluation by people who know what to look for.

Bottom Line

CAISI now covers every major closed-source AI lab in the United States. That is a real accomplishment, built on two years of quiet institutional work at NIST that represents precisely the kind of rigorous pre-deployment evaluation policymakers have demanded since GPT-4 landed in 2023. But the framework has a structural assumption baked into its foundation: that the lab controls distribution, and that pre-release findings can change deployment decisions. For five labs, that assumption holds, but for the lab with the most downloaded AI model family on Earth, it does not. Llama's 1 billion downloads are not a bug in AI policy. They never were. They are a feature of open-source development that the current governance framework was never designed to accommodate, and closing that gap requires tools that do not yet exist: post-release monitoring infrastructure, deployment-time evaluation mandates, and a liability framework for the thousands of downstream fine-tuners who inherit both the capabilities and the risks of models they did not build.