Half of GitHub's Code Is Now Written by AI. It Has 70% More Bugs.

Fifteen million. That is how many developers now use GitHub Copilot, according to GitHub's own disclosure from March 2026, and when you combine that adoption figure with the fact that ninety percent of Fortune 100 companies have deployed the tool, that the average Copilot user generates 46% of their committed code through it (rising to 61% for Java developers), and that a ScriptWalker/Stanford-MIT analysis of public GitHub commits now puts the overall AI-generation figure at 51% and climbing, you arrive at a simple conclusion. Machines write more code than humans on the world's largest code hosting platform.

A JetBrains 2026 developer survey confirms the behavioral shift: 68% of professional developers use an AI coding assistant every single day. Not occasionally. Daily.

What happens when the majority author produces measurably worse output? Five independent research efforts answered that question in the first four months of 2026. They agree.

The Data: Five Studies, One Pattern

CodeRabbit's State of AI vs. Human Code Generation report scanned 470 open-access GitHub repositories, compared pull requests by origin, and found that AI-generated code produced 1.7 times more bugs overall than human-written code, with the multiplier for critical and major issues ranging from 1.3x to 1.7x depending on severity classification. Logic errors? 75% higher in AI output, at 194 per 100 pull requests versus the human baseline. Concurrency bugs doubled, dependency errors doubled, and error handling failures doubled too, revealing worse null pointer discipline, fewer early returns, and weaker defensive coding patterns across the board.

Readability tells its own story: AI code showed three times as many readability issues, 2.66 times more formatting problems, and double the rate of naming inconsistencies, and these are not cosmetic concerns. A function named processData tells a maintenance engineer nothing about what it actually does, and when that same opaquely-named function also contains a subtle race condition that the original developer never caught because they accepted the Copilot suggestion with a tab keystroke in 0.4 seconds without reading past the first line, the readability deficit becomes a security deficit that propagates through every system that calls that function.

Performance caught everyone off guard. AI-generated code commits eight times more excessive I/O operations than human-written code. Eight times. In distributed systems where a single unnecessary database round-trip adds 50 milliseconds of latency per request, that multiplier translates directly into degraded user experience, higher infrastructure costs at scale, and the kind of subtle performance regression that takes months to diagnose because the code looks correct on visual inspection even though it is hammering the database with queries that a human would never have written.

The OWASP Top 10 study took a different approach: testing six major language models against the OWASP vulnerability categories specifically. One in four: 25.7% of all AI-generated code contained confirmed security vulnerabilities. GPT-5.2 performed best at 19.5%, while Claude Opus 4.6, DeepSeek V3, and Llama 4 Maverick tied for worst at 29.9% each. Broken Access Control dominated with 65 individual instances.

If a developer accepts AI-generated code involving authentication, authorization, or session management, there is roughly a one-in-four chance that code ships with a vulnerability that would land on OWASP's list of the ten most dangerous web application security flaws in the world.

Cobalt's 2026 Penetration Test Report extended the analysis beyond code quality into real-world exploitation, and the news gets worse: AI-introduced security flaws are 2.5 times more dangerous than traditional bugs, measured by exploitability and blast radius, only 38% of high-risk AI-related issues were addressed after disclosure, and one in five organizations surveyed had already experienced an LLM-related security incident in the preceding twelve months.

Second Talent's quality metrics study confirmed the pattern from yet another angle: 1.7x more total defects, 1.64x more maintainability errors, 1.57x more security findings. Their most sobering number: the top-performing AI coding models score just 39.6% on SWE-bench, the benchmark that measures performance on real-world software engineering tasks drawn from actual GitHub issues, which means the best AI coder in the world fails six out of ten times on the kinds of problems that junior developers solve during their first month on the job. Trust level? Three percent. Only 3% of developers fully trust AI-generated code without review.

The Bug Injection Rate: A Calculation Nobody Published

Each of these studies measured relative defect rates. None modeled the net impact on the entire software supply chain, and nobody ran the multiplication, so here it is.

Start with a pre-AI baseline: all code human-written, normalized bug rate of 1.0 per unit. One hundred units of new code, one hundred bugs.

Now apply the current state: 51% of new code is AI-generated at 1.7x the human bug rate. Human-written code fills the remaining 49% at the baseline rate of 1.0.

Source	Share of Code	Bug Rate	Bugs per 100 Units
AI-generated	51%	1.7x	86.7
Human-written	49%	1.0x	49.0
Total (current)	100%		135.7
Total (pre-AI baseline)	100%		100.0

Net increase: 35.7% more bugs entering the global codebase from the compositional shift alone.

For security vulnerabilities specifically, using Second Talent's 1.57x security finding rate:

Source	Share	Security Flaw Rate	Flaws per 100 Units
AI-generated	51%	1.57x	80.1
Human-written	49%	1.0x	49.0
Total (current)	100%		129.1

A 29.1% increase in security vulnerabilities, and that is the conservative estimate because this calculation assumes the total volume of code produced stays constant, which it does not. AI tools generate code faster, which means more total lines ship per quarter, which means the throughput multiplier amplifies the bug injection rate on top of the per-unit quality decline, though modeling that amplification precisely requires aggregate code volume data that GitHub has not disclosed.

Why The Bugs Are Hard to Find

The three-times readability penalty does not exist in isolation. It compounds the bug problem directly.

A human developer who writes a race condition does so within code they structured, named, and formatted according to their own mental model, which means another human reading that code can follow the naming conventions and control flow and spot the divergence where the logic breaks, and while the reviewer might still miss the bug, they can at least read the code.

AI-generated code is different: it looks superficially clean, but variable names are generic, functions do too many things, and abstractions are borrowed wholesale from training data rather than designed for the specific problem at hand. When a security flaw hides inside code that is also poorly named, inconsistently formatted, and making eight times more I/O calls than necessary, the reviewer has to fight through three layers of opacity before they can even begin to assess whether the logic is correct, by which point attention has eroded and the subtle null pointer exception in line 847 sails through review untouched.

This is why only 3% of developers fully trust AI output without review, yet 68% use it daily: they know the code needs checking. The question is whether they check every suggestion with sufficient rigor, across every Copilot autocomplete they accept with a tab keystroke, eight hours a day, fifty weeks a year, while context-switching between five Jira tickets and a Slack thread about the production incident from this morning. They do not.

The Counterargument at Full Strength

The strongest case against alarm goes like this: AI coding tools are improving fast, the 2025 measurement baseline may be obsolete within twelve months, and the comparison is structurally unfair.

Here is why: AI code disproportionately handles boilerplate, scaffolding, and repetitive patterns where bugs have lower blast radius. Humans still handle complex logic, architectural decisions, and edge-case reasoning where a single flaw can crash a production system. So the aggregate 1.7x multiplier might overstate real-world impact if AI bugs cluster in low-consequence code paths while human bugs cluster in high-consequence ones, because a null pointer exception in an auto-generated serialization helper is not the same animal as a broken authentication check in a hand-crafted payment flow.

Fair, and that deserves engagement, but it runs into two walls.

First: the OWASP study tested models specifically on security-critical code paths, including authentication, access control, and injection prevention, and still found a 25.7% vulnerability rate. Bugs are not confined to boilerplate. Second: the 8x I/O excess measured by CodeRabbit affects system-level performance regardless of where the offending function sits in the application hierarchy, because a single function making unnecessary database calls in a hot path can dominate the latency profile of an entire request pipeline serving millions of users per hour, a reality that the blast-radius argument does not account for because it assumes AI bugs are contained within the code that generated them rather than leaking through performance degradation into every service that depends on that code. Evidence suggests they leak.

Limitations

Constraints abound. The ScriptWalker/Stanford-MIT 51% figure comes from GitHub commit analysis, which may overcount AI involvement, because a developer who accepts a three-line Copilot completion inside a 200-line function that they otherwise wrote entirely by hand is not producing "AI-generated code" in the same sense as someone who prompts Claude for an entire module and pastes the output verbatim. CodeRabbit's 470 repositories are open-access. Enterprise codebases with stricter review processes and mandatory static analysis may show different patterns entirely. Bug detection methods vary across the five studies. Some count style violations that would never cause a production incident.

"AI-generated" exists on a spectrum that runs from minor completions through multi-line suggestions to whole-function generation to entire modules prompted from scratch, and no study cleanly isolates these segments, though the bug rates likely differ substantially across them.

The bug injection rate model assumes a clean replacement: human code swapped for AI code at the same task distribution, but reality is messier. AI may be generating code for tasks that humans would not have written at all, expanding the total codebase rather than substituting within it, which would make the before-and-after comparison less direct. Without GitHub disclosing aggregate commit volume trends broken down by AI involvement level, the throughput amplification factor remains a known unknown.

What You Can Do

If you manage an engineering team: Mandate review protocols that go beyond visual scanning, run static analysis on every PR regardless of origin, and consider tagging AI-generated code for additional scrutiny because most tools can detect Copilot suggestions. Cobalt's 2.5x danger multiplier from pentesting data means your existing triage rubric likely underweights AI-introduced flaws.

If you are a developer: Treat every AI suggestion as untrusted input, the same discipline you would apply to user-submitted data in a web form, and read it line by line. If you cannot explain what a function does from its name and structure alone, rewrite it until you can, because the 3x readability gap means the code AI gives you is optimized for looking correct at a glance, not for being maintainable eighteen months from now when the original context has evaporated and a junior engineer is trying to figure out why the function named handleUpdate sometimes drops writes under concurrent load.

If you are a CISO or security lead: The 29.1% net increase in security vulnerabilities is not a forecast but a description of the current state. Audit your organization's AI coding tool adoption rate, cross-reference it with vulnerability discovery metrics from the past twelve months, and look for correlation. If penetration test findings are trending upward while headcount is flat, AI-generated code is a plausible contributing factor. Cobalt's finding that only 38% of high-risk AI issues get remediated suggests the problem compounds quarter over quarter.

If you are evaluating AI coding tools for procurement: Demand vendor-specific bug rate data. The OWASP study showed a 10.4 percentage point spread between the best-performing model and the worst. That gap is the difference between one vulnerable suggestion in five and one in three, which makes model selection a security decision, not a features decision.

The Bottom Line

The software industry adopted AI coding tools faster than it developed the infrastructure to verify their output. That is the story. Fifteen million developers are using a tool that produces code with 70% more bugs, and the aggregate effect is a measurable 35.7% increase in defects flowing into the banking systems, medical devices, autonomous vehicles, and critical infrastructure that three and a half billion connected humans rely on every day, a number that will grow as AI code share rises from 51% toward the 70% or 80% that current adoption curves suggest within two years. The 3% who refuse to trust AI output without review have the data on their side. Every tab keystroke is a bet. The house edge just got measured.

It favors the bugs.