AI Finds the Holes. Only Your Engineers Can Tell Which Ones Are Real.

Jul 02, 2026

Every AI security tool now sold to find your bugs comes with the same admission, usually buried in the fine print: it does not work without a human to check the output.

OpenAI and Trail of Bits put it in plain sight. On June 22 they launched Patch the Planet, which turned frontier models loose on 19 open-source projects in its first week and filed 64 pull requests. Their writeup describes the models producing a “firehose of security findings” that already-stretched maintainers have to sift by hand to tell the real vulnerabilities from the false positives. Trail of Bits engineers triaged every finding before it reached a maintainer, because the expensive part of the work is no longer the finding but everything after it.

That is what every tool in the category sells you: the finder, and the checker stays your problem.

The finders all keep the same scoreboard

By Dustin Childs’s count, Microsoft shipped 208 CVEs on June 9, a record. He has tallied them since 2017 and called it “by far the largest monthly release” in that time. In May, Microsoft credited 16 of its fixes to an internal system running more than 100 AI agents. Mozilla shipped Firefox 150 with 271 vulnerabilities flagged by a single model, folded into 41 CVEs in the advisory. Palo Alto ran frontier models across its own products and surfaced 75 issues in a month that normally brings fewer than ten. Anthropic scanned more than 1,000 open-source projects with its Mythos model and logged 23,019 issues, a number it reports itself.

Every one of those is a discovery count. None of them is a fix count, and none of them is a risk count. VulnCheck tracked exploitation of every 2025 CVE: about 1% were ever used in an attack. Knowing about a hole changes nothing until someone closes it. At zero spare capacity, a vulnerability you have logged and cannot patch leaves you about as exposed as one you never found, with a longer backlog to show for it. The machines are filling a queue, not draining one, and the queue is made of work, not danger.

The one job the machine cannot do

You could read the whole surge as good news. The code did not get more broken, the tools just got better at seeing what was already there. It is a comforting story, and it falls apart the moment you remember who else owns the tools.

The attacker runs the same scanners and the same models against the same code, and gets the same list of holes. I covered that side of it in the security nightmare earlier this year. Better discovery was never a defensive edge, because both sides discover equally. What differs is the position each side works from. The attacker works on one thing and controls the clock: they polish the exploit, confirm it fires, and spend as long as that takes, because a broken exploit is wasted effort. The defender controls none of it. AI writes new code faster than anyone can review it, and the tools flag holes in that code faster than anyone can confirm them. That is where breaches come from: generation outrunning review, on the one side that never gets to set the pace. I traced the same gap in green dashboards.

A finding is a claim: there is a bug here, in this code. That claim needs a name on it.

The model that raised it cannot be the one to sign off. It re-checks its own output, and that helps at the margin, but a system that could clear its own false positives at scale would not be producing them at this volume to begin with.

A security team reading the alert cold sees the pattern, not the function it lives in. Only the engineer who already knows that code can tell you whether the danger is real here.

OpenAI runs on exactly that premise. Its engineers work directly with each project’s maintainers, reproduce the evidence, and confirm findings against the real code before anything reaches the people who own it. Anthropic built the same kind of tool, Claude Code Security, on the same admission: its scanner re-examines each result to filter its own false positives, attaches a confidence rating, and routes every finding to a human to approve, and the documentation tells you to review each proposed patch before you apply it. Two of the largest labs in the world shipped the bug-finder, and both of them made the human who checks it the part the whole thing rests on.

All of that lands on one desk: the engineer who has to look at each machine-generated claim and decide if it is real. Take that person out of the loop and you have not bought faster security. You have wired a high-volume false-positive generator straight into production.

The gap does not close

Lightrun surveyed 200 engineering leaders and found that 43% of AI-generated code changes still needed manual debugging in production after passing QA and staging, and not one of them reported being able to verify an AI fix in a single deploy cycle. The automated pipeline trips over plain functional bugs before it gets anywhere near a security review.

You will hear that this is temporary, that one more model generation will catch its own mistakes and you will not need the human. Veracode tested over 100 models and found 45% of generated code carried a security flaw, and the rate held whether the model was bigger, newer, or more expensive. That number is what the wait-and-see case depends on, and it has not moved.

I traced that loop in the mirror: the model reproduces the broken code it was trained on, then the next tool finds the breakage and bills you for the cleanup.

The job market is cutting exactly that person

So the obvious move would be to grow the one role that makes all of this output usable. The industry is doing the reverse.

The engineer with deep context on a codebase is the most expensive and least scalable input in software, and that is the input being cut, on the theory that AI now replaces it. The vendors already know it is the binding constraint. After scanning a thousand projects, Anthropic reported that the maintainers had become the bottleneck, and that finding the bugs turned out to be the easy part and fixing them the hard one. Of the high-severity findings it stopped to assess, a small share of everything it flagged, more than nine in ten held up as real. Accuracy does not rescue the defender. A true vulnerability nobody has time to triage or patch sits in the same queue as a false one. Its disclosure policy draws the obvious conclusion: it holds back how many findings it sends any single project, down to a rate its maintainers can keep up with.

The scarce resource was never the finding. It was the person who could confirm and fix what was found.

Gartner studied 350 firms in May and found the companies cutting hardest showed no improvement in financial returns. Forrester reported in January that many of the firms announcing AI-driven layoffs have no mature system actually ready to do the work of the people they let go.

The people who can tell a real finding from a false one are made slowly, on real codebases, over years. The tools are generating more work that only they can clear, and they are the ones getting cut.

From the trenches

I run this loop at small scale, and I will tell you what it actually costs.

We have two security review agents. One runs on ordinary changes and mostly returns false positives I clear by hand. The other covers the OWASP LLM Top 10 and runs only when a change touches model code, where the failure modes are different: prompt injection, improper output handling, excessive agency. Neither agent replaces the review. Both add to it. The honest description of AI security tooling in my shop is that it generates more things for a human who knows the code to go and check.

So the move that actually lowered our exposure came from the other direction. Less to find beats a better finder. We have been cutting dependencies hard, front and back. Snyk flagged Axios every single week, so we removed Axios entirely, and the alert stopped because the thing it alerted on was gone. Every dependency you delete is a stream of findings you never have to validate again. The cheapest finding to triage is the one that never enters your tree.

The same instinct governs what our models are allowed to touch. We keep their permissions as narrow as the task allows: only the systems they genuinely need, and read access wherever reading is the whole job. A model that queries a database gets its own credential, scoped to read and nothing else, so there is no write path to misuse. Around that we run a guardrail watching for any attempt to do something other than read, and an audit log of what the model reached for when it tried. A model will eventually try something you never asked for. Narrow scope decides how far it gets. Giving the model only what it has proven it needs is an engineer’s call, and it is the one the model will never make about itself.

The uncomfortable part

The same vendors selling you AI that found 23,000 holes are telling you to cut the engineers who would close them. Sam Altman put it on the record back in 2025: “maybe we do need less software engineers.” Both cannot hold, because a finding is worth nothing until one of those engineers confirms it is real.

You can automate finding the bug. The engineer who knows whether it matters is the one getting a severance package. But, yes, moron, we need fewer engineers.

Share with Sam, he needs to know how the real engineering works.

From the Trenches

Discussion about this post

Ready for more?