Green All the Way Down

Meta gutted the judgment that kept it standing, and every metric stayed green right up to the breach.

Jun 30, 2026

Between 17 April and 31 May, attackers took over as many as 20,225 Instagram accounts by asking an AI support bot Meta calls High Touch Support to send a password reset link to an email they controlled, and the bot sent it without checking that address against the one already on the account. The accounts included the Obama White House. Meta did not notice for six weeks, and named the cause in its June breach notification to the Maine Attorney General, “due to a bug in a separate code path, the system did not properly verify that the email address provided by the individual requesting a password reset matched the email address associated with that user’s Instagram account.”

The bot did what it was built to do, which was resolve fast, and resolution speed was a tracked number that stayed green, while ownership verification was not tracked, so it failed in the only place an untracked thing can fail, which is production.

This is not a Meta story, and it is not an argument against AI. The same tools, at Spotify and at my own shop, produce the opposite result, because there was something underneath them worth multiplying. Meta is the cleanest specimen of the multiplier running the other way, an org now clearing out the judgment that kept it standing for two decades to make a token count go up. I argued in the last piece that the industry had started dismantling engineering culture on purpose, and Meta is what that looks like once it breaks. Most companies building toward the same thing don’t know they’re building it.

The Machine

The machine has three parts, and they drive each other.

Start with metric substitution. Meta counts token consumption inside performance reviews, so an engineer’s AI usage is now a number in the file that decides the raise, and when you turn a proxy into a target people optimize the proxy, so token counts go up while whether those tokens produced anything worth shipping has no metric attached and stops being asked. I traced where that bill lands in Token Race. The part that matters here is upstream of the money: once the number is the goal, the judgment that would question the number turns into overhead. The incentive runs all the way to the absurd, where shipping an outage on unreviewed AI code is survivable but writing a function by hand, without an agent, is the thing that can mark you for the next round of cuts.

Judgment extraction follows, because once output is measured in tokens, the work that makes no tokens stops counting as work, and review makes no tokens, and neither does judgment, so when Meta needed bodies for data labeling it took roughly half of Instagram’s Trust and Safety team. The headcount didn’t drop and the org chart looks the same, but what left the building was the immune response, the people whose job was to notice that a password reset with no ownership check is a breach with a delay timer.

Then the inversion nobody prices in. AI raises the rate at which code enters the system, and nothing raises the rate at which that code earns trust, because verification still runs at human speed, since it is the one job you can’t hand to the thing you’re verifying. The gap between arrival speed and trust speed isn’t lag, it’s blast radius, compounding.

You can watch the inversion run on the vendors themselves. Anthropic published that prompt injection against its browser agent succeeds 23.6 percent of the time with no mitigations and 11.2 percent with them, which is one in nine after the fixes, from the company that builds the tools and has every reason to make that number look smaller. The trust problem is not solved, it is reported as a percentage and shipped anyway, and the dashboard that ships it stays green.

The Green Dashboard

60.2 trillion tokens at Meta in thirty days, 84 percent of Uber’s engineers on agentic tools, every adoption number green, and none of them predicted an incident, because none of them measure the thing that breaks.

The numbers that measure the thing that breaks look different.

Faros AI tracked 22,000 developers across more than 4,000 teams, comparing each organization’s lowest-AI-adoption quarters against its highest, and found that as adoption deepened throughput rose while incidents per pull request rose 242.7 percent and bugs per developer rose 54 percent, accelerating from last year’s 9 percent rise, so the degradation isn’t steady, it’s getting worse.

The collapse itself is not new to me. I documented its early shape in Quality Collapse, back when it showed up in static code metrics rather than production telemetry, and nothing since has bent the curve the other way.

Amazon is the version measured in dollars. On 5 March, an outage dropped orders across North American marketplaces by nearly 99 percent for six hours, 6.3 million orders gone, and an internal briefing flagged a trend of high-blast-radius incidents involving Gen-AI-assisted changes. Amazon says it was user error and removed the AI reference from the document. It also imposed a ninety-day code safety reset across 335 Tier-1 systems, which is a strange response to user error.

Codex is the version with no meter at all. The problem traces back to a logging change made in February and surfaced publicly when a developer named Rui Fan filed issue #28224 on 14 June, after his SSD took 37 terabytes of writes in 21 days, traced to a Codex feedback log that ran at its most verbose setting by default and was recording its own internal events, the logging included. That extrapolates to roughly 640 terabytes a year against a 1 TB consumer drive rated for around 600 terabytes of writes across its whole life, a full endurance budget spent in under twelve months. Someone in the thread posted a short SQLite trigger that drops every insert, a stranger’s patch for the vendor’s default, and OpenAI merged fixes on 22 June that reportedly remove most of the writes. The token bill at least arrives as a number you can read. This one arrives when the drive dies a year early.

Ship bugs freely because the agents patch them fast, and every visible signal improves at once: incidents get caught before they spread, coverage climbs, bug reports fall. Any one of those numbers can end an argument. The one that would end it the other way, the count of people who still understand the system, sits on no dashboard.

Metric improvement and system degradation are the same event here, read at two different times.

The Amplifier

All of that is the machine running in one direction, and one company over it runs the other way. Spotify told investors that almost all of its engineers use AI weekly and that most code is now AI-assisted, and unlike Meta’s numbers, theirs describe something real, because the agent rides on fifteen years of platform engineering and a Fleet Management system that already automated half their pull requests before Claude existed. I took the full case apart in Honk; the short version is that the AI replaced a twenty-thousand-line migration script, not the engineers, and it had a deep substrate of human judgment to multiply. Spotify is no clean fairy tale, it ran three rounds of layoffs and now pushes senior engineers toward an architect-and-editor model where agents write the routine code. What it didn’t do was reassign the judgment, and that is the whole distance between Spotify and Meta.

We run the small version of the same thing, where the review culture and the spec-first discipline are the substrate the AI rides on, and the output holds up. Spotify and Meta ran the same tools too. One of them still had engineers who could look at a green build and know it was green for the wrong reason, and the other was busy moving those engineers into data labeling.

The substrate isn’t something an engineer can install from below. Spotify’s came from fifteen years of leadership decisions, and Meta’s data-labeling reassignment was a leadership decision too, and so is putting token counts in the performance review. If the people above you have chosen the metric over the judgment, you don’t get to build Spotify from your desk, you get handed Meta, and the dashboard stays green the whole way down.

The Immune System

I still have that substrate, for now, and this month I watched exactly what it buys.

I was building a feature into our boilerplate, the foundation every new client project at my shop gets built on, so a defect there doesn’t ship to one product, it ships to all of them. The feature was a background loader, the kind Claude runs, where the model keeps generating on the server after you leave a chat mid-stream, work the backend had always done silently.

The bug came with that feature. The bot’s reply showed up fine while it streamed, and then the redirect that fires when a new chat gets its GUID and jumps to the chat’s detail page blanked it, leaving an empty space where the answer had been, while the message itself sat in the database the whole time. Nothing about that was obvious, because the loader itself worked, and I caught it because I test everything by hand after I build it.

My AI review agents and E2E tests found nothing. I found the bug, told the model where it was, and watched it fail five fixes in a row with Playwright, the DOM, and the live state all in hand. It got fixed when I stopped watching it flail and read the code myself, in the same lines it had read clean five times over. The gap was never missing data. It was nobody to connect what the product was doing to the line that was doing it.

That connection is the immune response, and it is the input AI burns fastest and refills slowest. Without a human in that loop, the same bug clears every automated gate green and ships into every project on the boilerplate.

I watch the same failure in smaller doses every week. One of my engineers spent two days editing text in a Google Doc from a browser extension, which hinges on one undocumented internal call, and told me the model couldn’t crack it. I didn’t take that on faith, I checked myself, already knowing the answer existed. Opus hit the same wall and argued my approach was wrong, until I gave it the prop name, _docs_annotate_canvas_by_ext, and it explained the answer back as if it had known all along. It was certain while it was wrong, and the only thing that closed the gap was someone who already had the answer.

The pipeline that produces the people who catch what the dashboards miss is the one being defunded to pay for the tools that generate the misses, and the new-grad and intern numbers behind it have been collapsing for two years. I laid out the human side of the load in Human Cost and the comprehension side in Comprehension Extinction, and the structural version is one sentence: you’re spending a reserve you stopped refilling.

Meta didn’t run out of engineers, it reassigned the ones who could read the password reset for what it was, and kept the dashboards green while it did.

The Uncomfortable Truth

None of this breaks the incentive structure, because the incentive structure is the cause.

Every dashboard a company builds measures the work it can see, and the work it can see is never the work that takes the system down.

The multiplier doesn’t care which way it runs. Put it over engineering culture, and it compounds the work, run it while that culture gets stripped for parts, and it compounds the rot, and the dashboard reports the same green for both. Every defect it leaves uncounted is a broken window painted green, and enough of those tell everyone watching that no one is keeping the place. The breach is just the first thing big enough to walk through.

From the Trenches

Discussion about this post

Ready for more?