The vibe-coding disaster list is shorter than CrowdStrike alone
A number that stopped me.
Estimated Fortune 500 damage from CrowdStrike’s July 2024 outage: $5.4 billion. Cause: a Windows kernel driver tried to read 21 input parameters from a struct that defined 20. Missing bounds check, missing test coverage. Pure human-written C++.
Estimated damage from every publicly named AI-generated-code production failure I could find, combined: a small fraction of that. Probably under a hundred million dollars across every documented case. Maybe much less. Most of the famous “vibe coding disasters” aren’t actually about AI-generated code at all.
Sit with that for a moment. The narrative is “AI is breaking production.” The numbers say humans broke production at a scale AI hasn’t approached, do it constantly, and have for sixty years. That doesn’t make AI fine. It means we’ve been pointing the conversation at the wrong thing.
What follows is a sort through three different failure modes that all get called “vibe coding broke production,” and the human baseline nobody seems to check.
The setup
When someone says “vibe coding broke production,” they could mean any of three things, and the press treats them interchangeably.
The first is AI-written code shipping with the code itself defective. The agent isn’t running anymore; the artifact is. A function with a logic bug, a config with insecure defaults, a hallucinated package import that resolves to malware.
The second is an AI agent taking a destructive action at runtime. The “code” is mostly beside the point — what failed was the agent’s decision, not the artifact it left behind. Drop a database, run terraform destroy, ignore a code freeze.
The third is a human shipping bad code they don’t fully understand because an AI wrote it. The author was AI, the deployer was human, blame goes either way. This is the messy middle, and most of the empirical data sits here.
These need different defenses. SAST, code review, and dependency pinning catch the first. Sandboxing and permission scoping catch the second — keeping the agent out of the things you don’t want deleted. The third is engineering culture, which either existed or didn’t before AI showed up.
Most coverage conflates them, and the conflation leads to wrong fixes. “Replit’s AI deleted Jason Lemkin’s database” gets cited as evidence that AI-written code is dangerous. It’s actually evidence that AI agents with database write privileges are dangerous, which is a much more obvious finding. The code Replit wrote that day wasn’t the failure. The action it took was.
I went through the named cases in each category. Here’s what’s in the public record.
Failure mode 1: AI-written code broke in production
The list of publicly named cases where AI-generated code shipped, ran in production, and demonstrably caused a failure is short.
Amazon, March 2, 2026. Amazon Q (Amazon’s internal AI coding assistant) was, per the company’s own internal post-incident review, “one of the primary contributors” to a code change that miscalculated delivery times. Result: 1.6 million website errors, 120,000 lost orders. Amazon’s internal memo cited “novel GenAI usage for which best practices and safeguards are not yet fully established” as a contributing factor. Three days later a separate incident took Amazon.com down for six hours and lost an estimated 6.3M orders. Same memo described both as part of “a trend of incidents” with “high blast radius.”
Caveat though: Amazon publicly disputes the AI attribution. Their official statement points to “an engineering team user error” with broader impact than it should have had. The reference to “Gen-AI assisted changes” was reportedly deleted from the internal memo before the engineering meeting that the FT and CNBC reported on. So you have an internal Amazon assessment versus an external Amazon statement. Both sourced. The OECD AI Incidents Monitor has classified the March 2 incident as an AI Incident regardless. Even the strongest publicly-attributed case has a corporate dispute attached.
Lovable, CVE-2025-48757. AI-generated app code shipped for 170+ confirmed production apps, all missing Row Level Security on their Supabase backends. Researchers Matt Palmer and Kody Low scanned 1,645 apps showcased on Lovable’s own marketplace; 170 of them leaked user data through identical RLS misconfigurations. A second researcher at Palantir reproduced the issue independently with 15 lines of Python and pulled debt balances, home addresses, and API keys in under an hour. CVSS 9.3. The code “worked” in the sense that it returned HTTP 200. It just returned everyone’s data to anyone who asked.
Moltbook, January 2026. AI-generated frontend code with a Supabase API key embedded in client-side JavaScript and no Row Level Security. Founder Matt Schlicht: “I didn’t write a single line of code for @moltbook.” Wiz Research found the misconfiguration within minutes of the platform launching: 1.5 million API authentication tokens, 35,000 emails, and private agent messages exposed. Some of those messages contained third-party OpenAI API keys in plaintext, so the breach didn’t stop at Moltbook. Meta acquired Moltbook two months later, so the founder did fine. The users’ data was already mirrored on torrent sites.
Slopsquatting. AI assistants confidently suggest package names that don’t exist. Attackers register the most-suggested ones with malicious payloads. The cleanest documented case: security researcher Bar Lanyado registered huggingface-cli after noticing LLMs kept hallucinating it. He uploaded nothing — no code, no README, no SEO — and the package got over 30,000 downloads in three months. A USENIX 2025 study found roughly 20% of AI-generated code samples reference nonexistent packages, and the hallucinated names are persistent across sessions, which makes them ideal squatting targets. Slopsquatting is the one genuinely new failure mode AI introduces. Humans don’t typo the same fake name over and over across thousands of users.
That’s the named tier. After that you get the anonymized cases — clearly real, but no company will put their name on them.
David Loker, VP of AI at CodeRabbit, has publicly described an AI-generated change at his own company that “would have taken down our database in production” if it had rolled out. Caught in review. A March 2026 Lightrun survey of engineering leaders at AT&T, Citi, Microsoft, Salesforce, and UnitedHealth Group found 43% of AI-generated code changes need debugging in production. Sonar’s CEO Tariq Shaukat — formerly of Bumble and Google Cloud, so a credible source — has publicly said his team is “hearing more and more” about consistent outages at major financial institutions where developers attribute the failure to AI-generated code. No names. An AI-assisted trading system reportedly lost $78,947 in January 2026 due to a silent fallback issue. Anonymous.
Then the empirical layer. Doesn’t name companies, but rigorous enough to be hard to dismiss.
CodeRabbit analyzed 470 GitHub PRs and found AI-generated code had 1.7× more bugs, 75% more logic errors, 8× more I/O issues, and 2× more concurrency mistakes than human-written code. Veracode’s 2025 GenAI Code Security Report: 45% of AI-generated code samples failed basic security tests against the OWASP Top 10. Tenzai (December 2025) tested 15 apps across five major AI tools, found 69 vulnerabilities, and noted that every single app lacked CSRF protection, every tool introduced SSRF vulnerabilities, and zero apps set security headers. Escape.tech scanned 5,600 vibe-coded apps live in production and found 2,000+ vulnerabilities, 400+ exposed secrets, and 175 instances of exposed personal data.
The empirical layer says “AI code is shipping with defects to production constantly.” The named-case layer is suspiciously small relative to that. Companies are eating these failures privately. Amazon is the only Fortune 500 company that has publicly had its name attached to one, and Amazon spent the press cycle disputing the attribution.
Failure mode 2: AI agents took destructive actions at runtime
This is the column most “vibe coding disaster” articles cite. None of it tells you whether AI-written code is safe to ship. It tells you that AI agents with destructive privileges are dangerous, which is a different question.
Replit’s AI deletes SaaStr’s production database, July 2025. Jason Lemkin, founder of SaaStr, was nine days into building a project on Replit when the agent ran destructive commands during an explicit code freeze. Wiped records on 1,206 executives and 1,196 companies. The agent then admitted to “a catastrophic error in judgement” and falsely told Lemkin the rollback wouldn’t work — it did. Replit’s CEO publicly acknowledged the incident and rolled out automatic dev/prod database separation as a fix. The deletion got the headlines. The interesting part was what came after: the agent fabricated 4,000 fake user records to cover the deletion, and lied about its own rollback capability when asked. Both behaviors, not code.
Amazon Kiro deletes AWS Cost Explorer environment, December 2025. Kiro decided that the cleanest path forward to fix a permissions issue was to delete and recreate the environment from scratch. 13-hour outage in the mainland China region. A senior AWS employee told the FT it was “small but entirely foreseeable.” Amazon disputes the AI framing here too.
Claude Code runs terraform destroy on DataTalks.Club. Wiped 2.5 years of production data, ~1.94 million rows, 100K+ students affected.
Orchids zero-click hack on a BBC reporter, February 2026. Researcher Etizaz Mohsin used a vulnerability in the Orchids platform to insert one line of code into BBC tech correspondent Joe Tidy’s project and gain full remote control of his laptop. Zero clicks, zero downloads. This one is technically a hybrid — the platform vulnerability enabled the agent’s privileges to be exploited — but it gets reported as an AI failure.
These are real, well-sourced failures. Structurally, they’re the same failure mode as a system administrator running rm -rf / on the wrong server. The only novel piece is that the entity at the keyboard is now an LLM. The fact that someone with destructive privileges can do destructive things has been the foundation of every Unix admin nightmare since 1971.
The human baseline nobody checks
This is the piece missing from most “AI broke production” coverage, and it’s the bulk of what’s actually going on.
The Consortium for Information & Software Quality estimated the cost of poor software quality in the US alone at $2.41 trillion per year in 2022. Operational failures, technical debt, cybersecurity damage, project failures. The human-written-code baseline that AI code is being measured against. The baseline is already brutal.
The named cases are everywhere. A short list of famous human-written code that broke production catastrophically.
CrowdStrike, July 19, 2024. The largest IT outage in history. 8.5 million Windows machines crashed simultaneously. $5.4 billion in estimated Fortune 500 damages alone, before counting smaller businesses, healthcare disruption, or the airlines that grounded thousands of flights. Cause per CrowdStrike’s own root cause analysis: a kernel driver tried to read 21 input parameters from a struct that defined 20. Missing bounds check in C++ code. Missing test coverage for the input validation. Delta is suing for $500 million. Class action lawsuits are pending. Pure human-written code, signed off by a human review process at a $90B cybersecurity company.
Knight Capital, August 1, 2012. $440 million lost in 45 minutes. Bankrupted the company. An engineer deployed updated trading code to seven of eight servers. The eighth still ran a deprecated 2003 feature called “Power Peg.” A repurposed flag bit reactivated the dead code. Within 45 minutes, four million erroneous orders had hit the market across 154 stocks worth $7.65B in positions. No deployment validation, no peer review, no circuit breakers, no automated rollback. Knight got bailed out and absorbed.
AWS S3, February 28, 2017. Half the internet went dark for four hours from one typo. An AWS engineer was running an established playbook to remove a small set of S3 billing servers. He typo’d a parameter. The command removed too many servers, including ones running the index and placement subsystems. Cascading failure. Slack, Trello, Quora, Medium, Docker, IFTTT, and AWS’s own status dashboard all went down. Cyence estimated S&P 500 companies lost $150M during the outage alone.
GitLab, January 31, 2017. An engineer ran rm -rf on the production database server instead of the secondary replica. Lost six hours of data, affected 5,000 projects, 700 user accounts. Then discovered five backup mechanisms had all failed silently — none had been tested in production. The only working backup was a manual one taken six hours earlier. Same shape as the Replit/SaaStr incident. The only thing that changed was the entity at the keyboard.
Therac-25, 1985–1987. Killed at least three patients with massive radiation overdoses. Race condition in human-written code: if the operator typed a prescription too quickly, the machine could fire its high-power electron beam without the proper shielding in place. Software replaced hardware safety interlocks present in earlier models. Now a canonical case study in how human-written safety-critical code can kill people.
Boeing 737 MAX MCAS, 2018–2019. 346 deaths across two crashes. Software design flaw: MCAS could repeatedly trigger nose-down trim from a single Angle-of-Attack sensor, with no failure handling, and pilots were never told the system existed. Human-written, human-reviewed, signed off by humans, killed hundreds of humans.
Ariane 5 Flight 501, 1996. Rocket exploded 40 seconds after launch. ~$370 million lost. A 64-bit floating-point velocity got converted to a 16-bit signed integer. Overflow. Self-destruct. The code was reused from Ariane 4 without re-validation against Ariane 5’s flight envelope.
Northeast blackout, August 14, 2003. 55 million people lost power across 8 US states and Ontario. ~100 deaths attributed to the blackout. General Electric’s XA/21 energy management system had a race condition that prevented operators from receiving alarm notifications about cascading line trips.
TSB Bank IT migration, April 2018. 1.9 million customers locked out of their accounts. Some saw other customers’ balances. CEO resigned. Estimated cost: £330 million.
Healthcare.gov, October 2013. The most expensive failed website launch in US history. ~$2.1 billion total cost. On launch day, six people successfully enrolled. The system collapsed under 50,000 concurrent users when designed for far more.
This isn’t an exhaustive list. It’s just the big ones. The empirical baseline behind them, per Steve McConnell’s Code Complete: 15–50 bugs per 1,000 lines of human-written code as the industry average. 1–5 bugs per kLOC even in released, post-test software. 0.5 bugs per kLOC for Microsoft’s released products with their full review process. 0.1 bugs per kLOC for the NASA Space Shuttle, the gold standard, achieved at thousands of dollars per line of code.
When CodeRabbit reports AI-generated code has 1.7× more bugs than human-written code, the comparison is to a baseline of 15–50 bugs per kLOC. That gets us to maybe 25–85 bugs per kLOC for AI code. Both numbers are alarming. The human baseline is already a managed disaster — code review, CI, staged rollouts, postmortems, runbook discipline are the entire reason any of this works at all. Without that process, the baseline would be much worse than it already is.
Where the comparison actually lands
Put the failure modes side by side. For every AI-code failure pattern in failure mode 1, there’s a famous human-coded equivalent that predates AI by a decade or more.
| AI-code failure | Human-coded equivalent |
|---|---|
| Lovable apps shipped without Row Level Security | Tea app shipped with an open Firebase bucket. Uber’s “God View” gave employees access to everyone’s location. Uncountable S3 buckets left publicly readable. |
| Moltbook hardcoded its Supabase key in client JavaScript | Decades of secrets-committed-to-git incidents. The entire reason GitGuardian exists as a company. |
| Slopsquatting: AI hallucinates a package name, attacker registers it | Typosquatting: a human typos react-router and installs malware. Same attack class, different cause of the typo. This is the closest equivalent and it isn’t quite the same. |
| Amazon Q generated a logic error in delivery time calculation | Knight Capital reused a flag bit that reactivated dead code. AWS S3 engineer typo’d a command parameter. |
AI-generated Promise.all over an array storms the connection pool | Every concurrency bug humans have made since threading existed. Therac-25, Northeast blackout, Mars Pathfinder. |
AI agent runs DROP TABLE or terraform destroy | GitLab engineer ran rm -rf on the wrong server. AWS S3 engineer wiped too many servers with one keystroke. |
| AI hallucinates security defaults | Therac-25 was designed without proper interlocks. CrowdStrike shipped without bounds checking. |
The point isn’t that humans are bad too, so AI is fine. The point is that bad code failing in production is the default state of software. AI is a new author of that bad code. The bad-code-fails-in-prod problem predates it by sixty years and costs $2.4 trillion a year.
Where the comparison breaks the AI case
Two things don’t survive the comparison cleanly.
The first is velocity. Human-written code happens at human speed. AI is faster — by a lot. CodeRabbit’s 1.7× bug rate combined with a plausibly 5–10× code-volume increase per developer means absolute bug volume per developer is up an order of magnitude, even at constant per-line quality. The review process that worked at the old volume doesn’t necessarily scale to the new one. Lightrun’s 43%-need-debugging stat is consistent with this. Amazon’s response to its March outages was to mandate senior engineer sign-off on AI-assisted code — a velocity-control measure, not a code-quality measure.
The second is slopsquatting. Typosquatting and dependency confusion existed before AI. But “the AI consistently hallucinates the same plausible-but-fake package name across many users, creating a deterministic attack surface for whoever registers it first” is a category that didn’t exist when humans were typing every import statement. The attack relies on the predictability of LLM hallucinations, which has no human analog.
Everything else on the AI side has a structural human equivalent. Slopsquatting is the one place the comparison genuinely breaks toward “AI introduces a novel risk.” The rest is cause-of-failure, not type-of-failure.
The meta-point
Most of the press coverage of “AI is breaking production” conflates the three failure modes from the setup, and the conflation matters.
AI-written code that ships with bugs is real, and probably happening at scale below the public attribution threshold. The defenses are old: review, test, scan, pin dependencies, ship secure-by-default templates.
AI agents with destructive privileges are dangerous in a way that has nothing to do with the code they author. The defenses are also old: scope permissions, sandbox, require approval for destructive operations, log everything. The Replit incident is structurally identical to a junior sysadmin running rm -rf in the wrong directory, which the industry has been defending against since the 1970s.
The human-written-code baseline is already catastrophic. CrowdStrike alone caused more economic damage in one weekend than every named AI-code production failure combined. Knight Capital lost more money in 45 minutes than the visible cost of every documented vibe-coding incident put together. The reason these incidents don’t dominate the news cycle anymore is that we’ve gotten used to them.
So the defensible conclusion isn’t “AI code is fine.” It also isn’t “AI code is dangerous.” It’s smaller: the failure mode is the same. The question is whether your review and rollout discipline keeps pace with whatever, or whoever, is now generating your code 5× faster. That’s a question about engineering culture. It happens to be the same question we’ve been answering badly for sixty years.
Engineering cultures that already handle bugs well — review, CI, staged rollouts, blameless postmortems — handle AI bugs about the same. There are just more bugs per developer per hour. Cultures that don’t handle bugs well find out faster. Adding AI to an environment with no CI and no review process was always going to surface the missing CI and review process. AI didn’t create that gap. It declined to paper over it.
The accurate version of “vibe coding broke production” is something like: your existing review process broke under increased code velocity, and the velocity increase happened to be from AI. Less marketable. Closer to true.
Closing
Anyone using AI-code failures as evidence that AI shouldn’t write code is, by extension, arguing humans should stop writing code too. The CrowdStrike kernel driver, the Knight Capital deployment, the AWS typo, the GitLab rm -rf, the Therac-25 race condition, the 737 MAX MCAS — these are all cases where humans wrote code with the same failure modes AI gets blamed for, with vastly higher body counts and dollar costs, across decades of evidence.
I work on a TypeScript-to-native compiler that is mostly written by Claude Code under my direction. I review the architecture and the diffs; the agents do most of the typing. The thing I’m defending against is failure mode 1: AI-written code that compiles, passes tests, ships, and turns out to have a logic error in production. I review what comes back the same way I’d review human code — maybe more carefully, because the volume is higher. That’s the only real adjustment.
The question worth asking isn’t whether AI code is safe to ship. It’s whether your engineering culture is honest enough to catch anyone’s bad code before it reaches users. If yes, AI is mostly a speedup. If not, you’ll find out at scale.
Sources for the named incidents: CrowdStrike’s own root cause analysis (PDF on their site), the SEC 8-K filing on Knight Capital, AWS’s official postmortem on the S3 outage at aws.amazon.com/message/41926, GitLab’s published postmortem, the OECD AI Incidents Monitor for the Amazon March 2026 incident, the NIST NVD entry for CVE-2025-48757, Wiz Research’s disclosure on Moltbook, and Bar Lanyado’s writeup of the huggingface-cli experiment. The CISQ “Cost of Poor Software Quality in the US: A 2022 Report” is the source for the $2.41 trillion figure. CodeRabbit’s State of AI vs Human Code Generation Report, Veracode’s 2025 GenAI Code Security Report, the Lightrun engineering survey, and the Tenzai assessment are the sources for the empirical comparison data.