My current client is a small company so, although I'm in a senior role, I only have to deal with PRs from a few juniors. I feel the burn more on my end, since I wear a lot of hats on such a small team and (under protest) have worked in claude code in hopes of improving throughput (futile hope, I argue). So probably 50% of the LLM code I review in this ongoing experiment, I prompted.
I also use it on pet/personal projects where the stakes are low. At worst, something only I use does something unfortunate. This is a fine use case for LLMs I think, but because I have been at this 20+ years I naturally check and recheck every line anyway. And that's where it gets weird. That's where I stare into the void.
It is very difficult to model the "mind" of a stochastic text generator. With other (human) programmers, I can have conversations with them, get a feel for their strengths and weaknesses, know when to second guess something that looks odd in their code and when to assume that, if the logic checks out, they had a reason for doing it the way they did.
Not so, at all, with chatbots. They will speak with absolute confidence in great depth and detail on any subject or specialty, then fuck up the most basic things, then spit out some fully functional if inelegant code, then asked for a small change, rewrite an unrelated half of their own code wrongly.
This is what fries my brain. With them it's contexts within contexts all the way down, all of them changing constantly according to some deranged rube goldbergian clockwork of tensors and matrices. Our brains are not made to deal with this fractal insanity. Huge portions of our psyche are built around dealing with other people, or people-like things, that exist within some boundary of predictability that can be discovered with observation and familiarity.
We leverage this subconsciously whenever we talk to our pets or plants, or see gods and spirits at work in the impersonal forces of nature, or wonder why our code or gadgets are misbehaving. But *especially* when something presents as human like chatbots do, this whole evolved subconscious architecture kicks in automatically.
And then the bot breaks it. Over and over. And we get exhausted, as we would in a bad relationship with a person with serious issues. Because no matter how much we tell ourselves logically, consciously, that the bot isn't human and can't be anticipated like one, that's only a single tiny input to the much larger true neural network inside our heads that begs to differ. And yet finds itself confused and disappointed moment to moment with the digital demon with whom we're trying to communicate.
This is the best comment this piece has gotten and you went somewhere I didn't in the article. The theory-of-mind angle is real and I think it's underrated in everything written about working with LLMs.
The thing you're describing about modeling other programmers, knowing when to second-guess and when to assume they had a reason, that's not a soft skill. That's the actual core of senior engineering. You're not reading code in isolation, you're reading it through a model of the person who wrote it. It's how you triage what to read carefully and what to skim. Without that model you have to read everything at full attention because you can't predict where the failure modes are. With LLMs you don't get to build the model because there isn't one to build. Every session is a stranger who lies confidently and inconsistently.
The fractal contexts thing is exactly it. With a human you eventually find the floor, the place where their reasoning bottoms out in something stable. Even bad programmers have a floor. LLMs don't have a floor because there's no continuous self underneath, just whatever the previous tokens happened to be. So your brain keeps drilling down looking for the bedrock and never hits anything.
The bad relationship analogy is going to stay with me. That's the right shape for what this feels like at the end of the day. You're not tired from the work, you're tired from the parasocial mismatch. Your subconscious keeps trying to do what it does with humans and gets nothing back, and that nothing-back is what wears you down.
I might write a follow-up piece on this. If you're ok with it I'd want to quote part of what you wrote here. Either way this is the comment I'm going to keep coming back to.
Right now if allowed to continuously learn they decohere, same as when you overtune with LoRA. It would be great if they were more stable but seems to be some missing pieces to enable that, and nobody knows what they are yet.
In both cases. I mean early on we saw what happened with Microsoft's Tay, and that was on par with GPT 2 iirc. But the problem persists in newer models.
I personally think it's a deficiency in the way training works, not necessarily the models themselves, but it's just a hunch. I have not messed around with training much, let alone done a deep dive on it or written my own. But it seems to me that it's pretty broad spectrum, a shotgun where you need a scalpel, and that may be why the model starts breaking down. For counterexample there's a pretty cool "decensoring" system that achieves ~90% reduction in prompt refusals without noticably impacting other metrics or real world performance, and it does it by isolating and modifying specific weights that activate when the model refuses a prompt. Something similar for correcting or enhancing other behaviors might bear fruit.
Anyway, of course there are practical problems with continuous learning too - you need about 2x VRAM for training vs. inference, so that would impact cost/profitability. Would have to decide whether the value of ongoing learning is worth the price vs. fixed weight models we presently deal with.
This is very well put. Negative prompts are great, but I only put negative prompts as a response when the AI does something retarded that I could never predict - because it would be infinite to try to negatively prompt everything I don't want in advance so you have to wait to see what it does, and every time you fix one problem, it seems to cause two more to appear like the sorcerer's apprentice, multiplying brooms. This time and energy of constantly telling it what not to do like a three-year-old with ADHD becomes the actual job.
"Telling it what not to do like a three-year-old with ADHD becomes the actual job." That's the whole piece I wrote in one sentence. The industry's response was to build instruction-following systems (CLAUDE.md, .cursorrules, AGENTS.md) to hold all these negative prompts in one place. Didn't fix it, just organized the wish list. https://techtrenches.dev/p/your-claudemd-is-a-wish-list-not
Yes. Doubly annoying that it cannot consistently apply general principles in adjacent tasks in the same session context.
“This is okay, but you could have easily generalized this bit here and here to reduce duplication.”
“Good. Catch! Okay, I've refactored to reduce duplication and improve maintainability.”
Sure, it talks like it understands the concept. But five minutes later, it does it again somewhere else. Doesn't matter how many times you tell it. Put it in AGENTS.md, doesn't matter. Explicitly point out opportunities for generalization in the initial spec prompt, it'll still write virtually the same thing in three places and wait for you to point it out and ask for changes. Or worse it'll write some half-baked abstraction with a bunch of branching to handle each use case explicitly and wait for you to tell it to do it right.
Anyway, I could go on, but /rant. It's just very frustrating. I wouldn't tolerate this kind of “laziness” from an intern. People are usually pretty good at avoiding the same errors once they grok a principle. But then people don't have to read the entire internet to be halfway-competent at anything at all, unlike LLMs. We are just not the same, and the chatbots are not remotely similar to humans in capability. Let alone “superhuman” like the jackass salespeople and hypeclowns keep saying.
This is why I ended up building three separate AI review agents on top of the generation layer. Code simplifier, fullstack enforcer, architect. Three layers of AI fixing what the first AI refused to follow, plus my review on top. And it still doesn't catch everything. The "catch, I'll fix it, then do it again five minutes later" pattern is the one that made me stop believing instruction-following is solvable at the prompt level. The model doesn't learn principles, it pattern-matches tokens, and the next context window is a clean slate.
Multiplying already-expensive API tokens further divides the value derived, that's the problem. Even if some N-council of chatbots could finally produce consistent junior-level code, how much would it cost end-to-end, and how much faster would it be, vs. just hiring a junior? Especially with the pricing of frontier models and their unpredictability in terms of cost thanks to “reasoning.”
As of right now the only model I consider worth the price is deepseek, since it delivers ~80% of the value for ~10% of the cost. But I'm not sure the hassle of setting up and babysitting all the infrastructure to do this is worth the effort even then.
It's a 3-year-old with ADHD that writes and sounds like an adult. We might not realize it, but as humans, the frustration of dealing with an adult that is incapable, for many hours of our day, is incredibly draining.
I say this because my son has a friend whose parents are huge and the kid was 6, but looked like a 10-year-old. The kid had a hard time making friends and being liked by parents because you were expecting the behavior of a 10 y/o when you actually had a 6 year old.
My grandma had dementia and I had the worst three months of my life living with her. She seemed coherent some days, only for the next day to (literally) hallucinate stuff.
Spot on. And the industry's answer is tools like mem0 that bolt memory onto a fundamentally stateless system. You're not making it remember. You're making it pretend to remember.
Good article, Denis and hope you are well and safe.
In my opinion generated code is by itself worthless. It can easily be created at will in any desired amount.
The actual “value” comes from someone accurately and precisely expressing intent (which is best done in a formal language with unambiguous interpretation, we used to call this programming) of how a machine should behave and the intent being in itself correct in the sense of properly and reliably solving whatever task/problem that led the programmer to write the code in the first place.
This is why using ai to generate large amounts of code and then attempting to critically review it seems asinine to me. The AI can not read your mind and figure out what your intent is. You have to express it, and if you can properly do so then translating that to code was always trivial.
Not rambling at all, this is exactly right. I came at it from the opposite direction. Thousands of AI supervision sessions taught me the same thing. AI can't code around vagueness, so the work shifts to writing specs precise enough that there's nothing left to interpret. At which point, as you say, the translation step is trivial.
Except there's a second problem on top of that. Even when you write the perfect spec, the AI doesn't follow it. CLAUDE.md, .cursorrules, AGENTS.md, Windsurf rules. Every AI coding company built instruction-following systems precisely because base models ignore project conventions. The proliferation is itself the admission. I wrote about this here: https://techtrenches.dev/p/your-claudemd-is-a-wish-list-not
So the engineer writes the spec, validates that the AI followed it (it didn't), fixes what it ignored, and still owns the outcome when production breaks. Spec-writing, validation, accountability. Only the typing got offloaded. That's not a productivity gain, that's a job description change nobody negotiated.
I strongly feel that we're going in the wrong direction with all of this. It's being mandated by leaders who don't care about burnout and salivate at the idea that we're training our future algorithmic replacements. The industry is being dehumanized and empathy is in short supply. AI could be a useful tool, but it's being used as an authoritarian mandate to clobber morale and prioritize productivity over reason, decency, and common sense.
I could write extensively about all the problems with "AI" but I really haven't had any in particular in coding to any major extent. I have been able to use it in spite of all that might be wrong with it. The reason for this is because I am doing my own projects and whether to use it or not is an option like with anything else. It's just a tool and it's on me to make proper use of it as defective as it might be. I work out what it can do. The one thing that I do not do is have it code for me. I can do that myself and I am in charge of that. It's like a lot of things really. Nice to have available if something comes up that actually warrants it but you are quite correct that the use should not be imposed. My case might not be reflective of in the industry. One advanced programmer alone is different to a lot of programmer shops where there are a lot of low quality programmers anyway. There might then be a desire to use artificial programmers to replace or enhance juniors with the assumption that it will be the same or better but the burden is not the same. There are many aspects to this but one of them is that the AI is like the Borg collective or a switchboard in its training. There are in a sense many people inside of it rather than a singular coherent person. It's really all over the place. It's like reviewing the code fragments of a million people. It can change all over the place. This is just one example then you have its ability to produce too much too easily similarly to the copypaste programmer. It'll also never be able to learn certain things where a junior programmer, even one with a lack of talent, can.
The burnout stat is the buried lede here. We celebrate the code volume increase while ignoring the 88% burnout. That’s not a productivity gain. That’s borrowing against your engineers’ health to show a metric that looks good in a board deck. Senior engineers know when they’re being consumed, not multiplied.
A key point about the context switching - I doubt it’s good for anyone but… Neurodivergent people are over-represented in tech. Part of the reason is they thrive in hyper-focus mode. Context switching is the opposite of that. I predict a tsunami of burnout. Perhaps tech companies will keep churning through the excess (laid off) staff like Amazon warehouses churn through low paid workers. I’ve never understood who all these staff-laying-off / price-raising companies think is going to be able to buy their stuff. People in China?
The neurodivergent angle is the part I didn't dig into and should have. This is literally what's killing me personally. Not the volume. The constant switches between fundamentally different cognitive modes. Validation, generation, decision, communication, every few minutes. My brain produces its best work in deep hyper-focus, and the AI workflow is the exact inversion of that.The tsunami you're predicting is already starting. The people who delivered the most are the ones breaking first.
I have the exact same issue with context switching although mine came about through working on and supporting fragmented products. I don’t think AI has hit tech in the UK as much yet.
It's not all that true any more and it's one of the reasons I now only work alone or on benefits. It used to be true. Now it's very patchy and increasingly rare. It's filled with socialites now who insist you have to use this style, that style, this framework, all because it's the current fashion. It's unbearable already for people like myself who actually have some form of high functioning autism. These are the same people who would form gangs when socially developing then want to find someone singled out to violently attack. It's impossible for me to work in a current working environment because I'm always on the cusp of being about to break someone's neck and having to hold myself back. I mean for christ's sake the number of times you have to tell someone to shut up telling you to break up a function that doesn't need it just for their social need to dominate over others and get them to do things to assert their position in their imaginary psychosocial hierarchy or their perverse need to control others. It's just not healthy or safe to be in that situation all day every day of wanting to constantly mutilate those around you. Self employment is really the only viable option at this point. There's no freedom in corporate technology. It's now a mass industry just filled with normal people churned out of the universities or whatever.
I agree with you on “corporate technology” although in my experience tech depts of older large companies have been like that for decades (banking, insurance etc). I suppose it took a while for the younger tech companies to catch up. Small to medium companies are better although product support can still have a lot of context switching.
The difficult is that you program something that works fine. The code is excellent. It's a masterpiece. There are zero bugs. It is perfection. It is clean. It can be read. It is efficient. The job is done. Yet it is not. Someone will find something wrong with it even though there is nothing wrong with it. They will demand you can't just do one thing. You have to do ten other things for the sake of it. It is nothing to do with the job description, the requirements of the business or any technical concern at all. It's all social and psychological. You can't just do one piece of code that does what it is supposed to. You have to go to this file and that file for no actual reason.
Meanwhile the rest of the team has spend six months on the same task for the same type of device but merely a different vendor and is still not done. For you see they decided they wanted to do it the professional way. They decide to make a microservice. The first thing they do is import a framework that pulls in a million lines for something that can be done without in a few hundred. Then they decided to use PHP for HTTP services which would not be a problem except they decided they wanted to learn everything on the job specifications for all the financial software companies in the city that use Java paying Oracle through the nose for it. The next thing you know they are creating hundreds of thousands of lines by hand because they are unrolling the types and passing them as parameters through interface names or method names as PHP doesn't have generics. To this day I still don't think they are done. They're still trying to finish all the unit tests first before they can finish the code.
LLMs are really obnoxious but as long as you're in control you can tell them to shut up or turn them off. That always causes the person to react like it's not their fault when I do that to them and I never hear the end of it when they started it. If the LLM isn't confined to a browser tab then that's a big problem right there. That's too much access. It needs to stay in its box.
The sad future of programming is that it won't involve any creation. It will involve working on massive validation systems that dont exist yet, while the AI writes all the "fun" code.
It's inevitable because as time goes on there will be less and less programmers that have the experiance of todays senior devs. After all where will they get the experience if they are ai coding
Fairly senior engineer here, at least in the sense that I spend most of my time reviewing other people's code (and have been since well before GPT got popular).
Much of what you say is true and my candle is surely burning at both ends, but I still feel that I'm getting a lot of value out of AI assistance, including assistance with code review and debugging. It's unfortunate that the latter capabilities lag behind what one might call the script kiddy aspect, but they are nonetheless improving, and I have hopes of reaching a better equilibrium.
But we'll see if I still have hope when the next batch of summer interns arrives and wreaks havoc.
The review and debug capabilities lagging behind generation is exactly the inversion of where the industry should be investing. Generation is the cheap part. Understanding what got generated is where the actual value lives, and that's the skill AI is worst at. If the next generation of tooling fixes that asymmetry I'll be the first one happy about it.
On the summer interns, I wrote about this in the comprehension extinction piece. The part that keeps me up isn't what they'll wreak this summer, it's what they'll look like in five years when they're "senior" by title but have never built a mental model from scratch. That's the bill that comes due later.
I am curious if there is any numbers/data available for the physical toll of the brain and if that is broken out by task (learning, code/quality review, etc.). There was a few weeks where I was deep in technical papers for the entire week and that felt more exhausting than other activities.
One metric might be glutamate. One blog that I have seen used that as a metric.
Yeah there's a paper on this. Wiehler et al., Current Biology 2022. They scanned people's brains across a workday and found glutamate buildup in the lateral prefrontal cortex after hard cognitive work. Literally a byproduct piling up in the region you use for control and decisions. Mental fatigue isn't a feeling, it's chemistry.
Dense technical reading is probably the worst case because you're building the model from scratch with nothing to lean on. Code at least gives you syntax. A paper gives you prose and you hold the whole thing in your head.
Nothing clean on cost broken down by task type. If you find something, send it over.
My $.02 on dealing with cognitive overload of AI generated PRs: use AI, generate a readable document that explains key points of proposed changes. It really helps to see a big picture and identify high-level architectural screw ups generated by AI before diving into a sea of code lines.
This is an observation I had personally about a year ago, and it quite surprised me. The AI-structured tools I made at work created more work for me and teams around me like a shockwave.
If code can be generated and tested at roughly 1:1000 effort ratio compare with before, the bottleneck shifts drastically.
I don’t even think about work I do now as “coding”, that’s incidental. I am taking very large problems and handing them to Claude code, and seeing how it solves the problem, not what code it generates to do so.
I wrote something about this a decade ago - complex code which is generated on demand and thrown away when useless. When the cost goes to zero for software, how you organize solving a problem becomes much more important how.
Just came across this. Thank you for expressing so well something which I have been feeling since using AI more and more at work.
At the end of the day my brain feels both too empty and too full, an aching vacuum, as if I had spent the entire day eating a humongous meal, but I was still hungry.
I do a lot of strength training, and I can in a way compare the feeling to muscle failure. While at the gym you break down the muscle (in a controlled way, following a program, with a lot rest as part of the process) to rebuild it stronger, here it feels like my brain is just breaking down, day after day.
I blamed it on the context switch at first, even tho I try my best not to have multiple agents running at the same time.
Then I wondered whether it was the lack of flow: when I code "by hand", it's like I'm staring at a puzzle and I'm trying to make sense of it. Is this the corner piece? Should this go here? And at some point things click, I put music on, and I can code for hours. It's something which gives me a great deal of pleasure (that all elusive flow state!).
With AI, I'm never in flow. I saw people describing context switching over 290 agents, each of them wanting something from the "human orchestrator" as an amazing thing; I just get exhausted.
The amazing comment from Fukitol really hit a different angle: it's akin to being in a bad relationship. Wow. I wish I never had the comparison ready, but yeah, I've been in a relationship in which we were trying to communicate *a lot* but something was not aligning, no matter how many words were exchanged. Sometimes I feel like that. Trying to articulate my intent for a semi black box to capture it, understand it, and sometimes, depending on the gradient, no amount of communication will make the model do what it needs to do.
But they will almost always explain to you why they're right, or hallucinate into what they've done, leaving the burden to you to disprove it, fix it, and deal with the consequences.
A bad relationship. Great take :)
On the flip side, this AI shift is one of the reasons I took up writing here on Substack, recently: I need a creative output to get things out of my head, now that writing code is not that outlet anymore.
The muscle failure analogy is the one I wish I'd thought of. At the gym you break down in a controlled way with a recovery protocol and you come back stronger. This is uncontrolled breakdown with no recovery and no adaptation. Just the same damage again tomorrow.
The flow state part hits close. I used to lose hours in code the way you describe. Puzzle, click, music, gone. With AI I'm never gone. I'm always routing, always switching, always half-reading something a machine wrote while half-thinking about whether it matches what I asked for. That's not engineering anymore. That's project management of a very fast intern who gaslights you when you point out mistakes.
I’ve been circling the same problem from the governance side: AI can scale production faster than organisations can scale judgement. I am not an engineer but a cyber security professional, but the problems we face, the sheer pace of doing needing review are the same.
The trap is pretending “human review” still means control when the reviewer is exhausted, outpaced, and working from evidence they did not create. At that point the human is not in the loop so much as holding the liability.
I wrote something adjacent on this recently, less from the burnout angle and more from the trust-boundary angle: which loops should humans own, which should machines run, and where does review stop working?
As much as I like the article, here's a passage that got me thinking:
"The SmartBear/Cisco study established numbers everyone ignores: defect detection drops from 87% for PRs under 100 lines to 28% for PRs over 1,000 lines."
If we're talking about the same often-quoted research from 2006 (neither the link nor the linked article leads to the actual paper), I don't think they claimed anything like that. Sounds like a hallucination. They had very few data points with PRs longer than 1,000 lines, and explicitly discarded those longer than 2,000 from the data.
Now, I don't discuss the general conclusion (bigger PRs, worse defect detection), and the research supports that point of view. Yet, the numbers seem made up.
My current client is a small company so, although I'm in a senior role, I only have to deal with PRs from a few juniors. I feel the burn more on my end, since I wear a lot of hats on such a small team and (under protest) have worked in claude code in hopes of improving throughput (futile hope, I argue). So probably 50% of the LLM code I review in this ongoing experiment, I prompted.
I also use it on pet/personal projects where the stakes are low. At worst, something only I use does something unfortunate. This is a fine use case for LLMs I think, but because I have been at this 20+ years I naturally check and recheck every line anyway. And that's where it gets weird. That's where I stare into the void.
It is very difficult to model the "mind" of a stochastic text generator. With other (human) programmers, I can have conversations with them, get a feel for their strengths and weaknesses, know when to second guess something that looks odd in their code and when to assume that, if the logic checks out, they had a reason for doing it the way they did.
Not so, at all, with chatbots. They will speak with absolute confidence in great depth and detail on any subject or specialty, then fuck up the most basic things, then spit out some fully functional if inelegant code, then asked for a small change, rewrite an unrelated half of their own code wrongly.
This is what fries my brain. With them it's contexts within contexts all the way down, all of them changing constantly according to some deranged rube goldbergian clockwork of tensors and matrices. Our brains are not made to deal with this fractal insanity. Huge portions of our psyche are built around dealing with other people, or people-like things, that exist within some boundary of predictability that can be discovered with observation and familiarity.
We leverage this subconsciously whenever we talk to our pets or plants, or see gods and spirits at work in the impersonal forces of nature, or wonder why our code or gadgets are misbehaving. But *especially* when something presents as human like chatbots do, this whole evolved subconscious architecture kicks in automatically.
And then the bot breaks it. Over and over. And we get exhausted, as we would in a bad relationship with a person with serious issues. Because no matter how much we tell ourselves logically, consciously, that the bot isn't human and can't be anticipated like one, that's only a single tiny input to the much larger true neural network inside our heads that begs to differ. And yet finds itself confused and disappointed moment to moment with the digital demon with whom we're trying to communicate.
This is the best comment this piece has gotten and you went somewhere I didn't in the article. The theory-of-mind angle is real and I think it's underrated in everything written about working with LLMs.
The thing you're describing about modeling other programmers, knowing when to second-guess and when to assume they had a reason, that's not a soft skill. That's the actual core of senior engineering. You're not reading code in isolation, you're reading it through a model of the person who wrote it. It's how you triage what to read carefully and what to skim. Without that model you have to read everything at full attention because you can't predict where the failure modes are. With LLMs you don't get to build the model because there isn't one to build. Every session is a stranger who lies confidently and inconsistently.
The fractal contexts thing is exactly it. With a human you eventually find the floor, the place where their reasoning bottoms out in something stable. Even bad programmers have a floor. LLMs don't have a floor because there's no continuous self underneath, just whatever the previous tokens happened to be. So your brain keeps drilling down looking for the bedrock and never hits anything.
The bad relationship analogy is going to stay with me. That's the right shape for what this feels like at the end of the day. You're not tired from the work, you're tired from the parasocial mismatch. Your subconscious keeps trying to do what it does with humans and gets nothing back, and that nothing-back is what wears you down.
I might write a follow-up piece on this. If you're ok with it I'd want to quote part of what you wrote here. Either way this is the comment I'm going to keep coming back to.
Feel free to quote, no attribution (or anon) please. I like to keep my substack presence low profile.
Maybe the key is in creating a continuous AI persona so they can learn.
Right now if allowed to continuously learn they decohere, same as when you overtune with LoRA. It would be great if they were more stable but seems to be some missing pieces to enable that, and nobody knows what they are yet.
Do you have interaction with people or just with themselves?
In both cases. I mean early on we saw what happened with Microsoft's Tay, and that was on par with GPT 2 iirc. But the problem persists in newer models.
I personally think it's a deficiency in the way training works, not necessarily the models themselves, but it's just a hunch. I have not messed around with training much, let alone done a deep dive on it or written my own. But it seems to me that it's pretty broad spectrum, a shotgun where you need a scalpel, and that may be why the model starts breaking down. For counterexample there's a pretty cool "decensoring" system that achieves ~90% reduction in prompt refusals without noticably impacting other metrics or real world performance, and it does it by isolating and modifying specific weights that activate when the model refuses a prompt. Something similar for correcting or enhancing other behaviors might bear fruit.
Anyway, of course there are practical problems with continuous learning too - you need about 2x VRAM for training vs. inference, so that would impact cost/profitability. Would have to decide whether the value of ongoing learning is worth the price vs. fixed weight models we presently deal with.
This is very well put. Negative prompts are great, but I only put negative prompts as a response when the AI does something retarded that I could never predict - because it would be infinite to try to negatively prompt everything I don't want in advance so you have to wait to see what it does, and every time you fix one problem, it seems to cause two more to appear like the sorcerer's apprentice, multiplying brooms. This time and energy of constantly telling it what not to do like a three-year-old with ADHD becomes the actual job.
"Telling it what not to do like a three-year-old with ADHD becomes the actual job." That's the whole piece I wrote in one sentence. The industry's response was to build instruction-following systems (CLAUDE.md, .cursorrules, AGENTS.md) to hold all these negative prompts in one place. Didn't fix it, just organized the wish list. https://techtrenches.dev/p/your-claudemd-is-a-wish-list-not
Yes. Doubly annoying that it cannot consistently apply general principles in adjacent tasks in the same session context.
“This is okay, but you could have easily generalized this bit here and here to reduce duplication.”
“Good. Catch! Okay, I've refactored to reduce duplication and improve maintainability.”
Sure, it talks like it understands the concept. But five minutes later, it does it again somewhere else. Doesn't matter how many times you tell it. Put it in AGENTS.md, doesn't matter. Explicitly point out opportunities for generalization in the initial spec prompt, it'll still write virtually the same thing in three places and wait for you to point it out and ask for changes. Or worse it'll write some half-baked abstraction with a bunch of branching to handle each use case explicitly and wait for you to tell it to do it right.
Anyway, I could go on, but /rant. It's just very frustrating. I wouldn't tolerate this kind of “laziness” from an intern. People are usually pretty good at avoiding the same errors once they grok a principle. But then people don't have to read the entire internet to be halfway-competent at anything at all, unlike LLMs. We are just not the same, and the chatbots are not remotely similar to humans in capability. Let alone “superhuman” like the jackass salespeople and hypeclowns keep saying.
This is why I ended up building three separate AI review agents on top of the generation layer. Code simplifier, fullstack enforcer, architect. Three layers of AI fixing what the first AI refused to follow, plus my review on top. And it still doesn't catch everything. The "catch, I'll fix it, then do it again five minutes later" pattern is the one that made me stop believing instruction-following is solvable at the prompt level. The model doesn't learn principles, it pattern-matches tokens, and the next context window is a clean slate.
Multiplying already-expensive API tokens further divides the value derived, that's the problem. Even if some N-council of chatbots could finally produce consistent junior-level code, how much would it cost end-to-end, and how much faster would it be, vs. just hiring a junior? Especially with the pricing of frontier models and their unpredictability in terms of cost thanks to “reasoning.”
As of right now the only model I consider worth the price is deepseek, since it delivers ~80% of the value for ~10% of the cost. But I'm not sure the hassle of setting up and babysitting all the infrastructure to do this is worth the effort even then.
it's way worse than a 3-year-old with ADHD
It's a 3-year-old with ADHD that writes and sounds like an adult. We might not realize it, but as humans, the frustration of dealing with an adult that is incapable, for many hours of our day, is incredibly draining.
I say this because my son has a friend whose parents are huge and the kid was 6, but looked like a 10-year-old. The kid had a hard time making friends and being liked by parents because you were expecting the behavior of a 10 y/o when you actually had a 6 year old.
My grandma had dementia and I had the worst three months of my life living with her. She seemed coherent some days, only for the next day to (literally) hallucinate stuff.
Spot on. And the industry's answer is tools like mem0 that bolt memory onto a fundamentally stateless system. You're not making it remember. You're making it pretend to remember.
Good article, Denis and hope you are well and safe.
In my opinion generated code is by itself worthless. It can easily be created at will in any desired amount.
The actual “value” comes from someone accurately and precisely expressing intent (which is best done in a formal language with unambiguous interpretation, we used to call this programming) of how a machine should behave and the intent being in itself correct in the sense of properly and reliably solving whatever task/problem that led the programmer to write the code in the first place.
This is why using ai to generate large amounts of code and then attempting to critically review it seems asinine to me. The AI can not read your mind and figure out what your intent is. You have to express it, and if you can properly do so then translating that to code was always trivial.
Sorry for the rambling.
Not rambling at all, this is exactly right. I came at it from the opposite direction. Thousands of AI supervision sessions taught me the same thing. AI can't code around vagueness, so the work shifts to writing specs precise enough that there's nothing left to interpret. At which point, as you say, the translation step is trivial.
Except there's a second problem on top of that. Even when you write the perfect spec, the AI doesn't follow it. CLAUDE.md, .cursorrules, AGENTS.md, Windsurf rules. Every AI coding company built instruction-following systems precisely because base models ignore project conventions. The proliferation is itself the admission. I wrote about this here: https://techtrenches.dev/p/your-claudemd-is-a-wish-list-not
So the engineer writes the spec, validates that the AI followed it (it didn't), fixes what it ignored, and still owns the outcome when production breaks. Spec-writing, validation, accountability. Only the typing got offloaded. That's not a productivity gain, that's a job description change nobody negotiated.
I strongly feel that we're going in the wrong direction with all of this. It's being mandated by leaders who don't care about burnout and salivate at the idea that we're training our future algorithmic replacements. The industry is being dehumanized and empathy is in short supply. AI could be a useful tool, but it's being used as an authoritarian mandate to clobber morale and prioritize productivity over reason, decency, and common sense.
I could write extensively about all the problems with "AI" but I really haven't had any in particular in coding to any major extent. I have been able to use it in spite of all that might be wrong with it. The reason for this is because I am doing my own projects and whether to use it or not is an option like with anything else. It's just a tool and it's on me to make proper use of it as defective as it might be. I work out what it can do. The one thing that I do not do is have it code for me. I can do that myself and I am in charge of that. It's like a lot of things really. Nice to have available if something comes up that actually warrants it but you are quite correct that the use should not be imposed. My case might not be reflective of in the industry. One advanced programmer alone is different to a lot of programmer shops where there are a lot of low quality programmers anyway. There might then be a desire to use artificial programmers to replace or enhance juniors with the assumption that it will be the same or better but the burden is not the same. There are many aspects to this but one of them is that the AI is like the Borg collective or a switchboard in its training. There are in a sense many people inside of it rather than a singular coherent person. It's really all over the place. It's like reviewing the code fragments of a million people. It can change all over the place. This is just one example then you have its ability to produce too much too easily similarly to the copypaste programmer. It'll also never be able to learn certain things where a junior programmer, even one with a lack of talent, can.
The burnout stat is the buried lede here. We celebrate the code volume increase while ignoring the 88% burnout. That’s not a productivity gain. That’s borrowing against your engineers’ health to show a metric that looks good in a board deck. Senior engineers know when they’re being consumed, not multiplied.
A key point about the context switching - I doubt it’s good for anyone but… Neurodivergent people are over-represented in tech. Part of the reason is they thrive in hyper-focus mode. Context switching is the opposite of that. I predict a tsunami of burnout. Perhaps tech companies will keep churning through the excess (laid off) staff like Amazon warehouses churn through low paid workers. I’ve never understood who all these staff-laying-off / price-raising companies think is going to be able to buy their stuff. People in China?
The neurodivergent angle is the part I didn't dig into and should have. This is literally what's killing me personally. Not the volume. The constant switches between fundamentally different cognitive modes. Validation, generation, decision, communication, every few minutes. My brain produces its best work in deep hyper-focus, and the AI workflow is the exact inversion of that.The tsunami you're predicting is already starting. The people who delivered the most are the ones breaking first.
I have the exact same issue with context switching although mine came about through working on and supporting fragmented products. I don’t think AI has hit tech in the UK as much yet.
It's not all that true any more and it's one of the reasons I now only work alone or on benefits. It used to be true. Now it's very patchy and increasingly rare. It's filled with socialites now who insist you have to use this style, that style, this framework, all because it's the current fashion. It's unbearable already for people like myself who actually have some form of high functioning autism. These are the same people who would form gangs when socially developing then want to find someone singled out to violently attack. It's impossible for me to work in a current working environment because I'm always on the cusp of being about to break someone's neck and having to hold myself back. I mean for christ's sake the number of times you have to tell someone to shut up telling you to break up a function that doesn't need it just for their social need to dominate over others and get them to do things to assert their position in their imaginary psychosocial hierarchy or their perverse need to control others. It's just not healthy or safe to be in that situation all day every day of wanting to constantly mutilate those around you. Self employment is really the only viable option at this point. There's no freedom in corporate technology. It's now a mass industry just filled with normal people churned out of the universities or whatever.
I agree with you on “corporate technology” although in my experience tech depts of older large companies have been like that for decades (banking, insurance etc). I suppose it took a while for the younger tech companies to catch up. Small to medium companies are better although product support can still have a lot of context switching.
The difficult is that you program something that works fine. The code is excellent. It's a masterpiece. There are zero bugs. It is perfection. It is clean. It can be read. It is efficient. The job is done. Yet it is not. Someone will find something wrong with it even though there is nothing wrong with it. They will demand you can't just do one thing. You have to do ten other things for the sake of it. It is nothing to do with the job description, the requirements of the business or any technical concern at all. It's all social and psychological. You can't just do one piece of code that does what it is supposed to. You have to go to this file and that file for no actual reason.
Meanwhile the rest of the team has spend six months on the same task for the same type of device but merely a different vendor and is still not done. For you see they decided they wanted to do it the professional way. They decide to make a microservice. The first thing they do is import a framework that pulls in a million lines for something that can be done without in a few hundred. Then they decided to use PHP for HTTP services which would not be a problem except they decided they wanted to learn everything on the job specifications for all the financial software companies in the city that use Java paying Oracle through the nose for it. The next thing you know they are creating hundreds of thousands of lines by hand because they are unrolling the types and passing them as parameters through interface names or method names as PHP doesn't have generics. To this day I still don't think they are done. They're still trying to finish all the unit tests first before they can finish the code.
LLMs are really obnoxious but as long as you're in control you can tell them to shut up or turn them off. That always causes the person to react like it's not their fault when I do that to them and I never hear the end of it when they started it. If the LLM isn't confined to a browser tab then that's a big problem right there. That's too much access. It needs to stay in its box.
The sad future of programming is that it won't involve any creation. It will involve working on massive validation systems that dont exist yet, while the AI writes all the "fun" code.
It's inevitable because as time goes on there will be less and less programmers that have the experiance of todays senior devs. After all where will they get the experience if they are ai coding
Fairly senior engineer here, at least in the sense that I spend most of my time reviewing other people's code (and have been since well before GPT got popular).
Much of what you say is true and my candle is surely burning at both ends, but I still feel that I'm getting a lot of value out of AI assistance, including assistance with code review and debugging. It's unfortunate that the latter capabilities lag behind what one might call the script kiddy aspect, but they are nonetheless improving, and I have hopes of reaching a better equilibrium.
But we'll see if I still have hope when the next batch of summer interns arrives and wreaks havoc.
The review and debug capabilities lagging behind generation is exactly the inversion of where the industry should be investing. Generation is the cheap part. Understanding what got generated is where the actual value lives, and that's the skill AI is worst at. If the next generation of tooling fixes that asymmetry I'll be the first one happy about it.
On the summer interns, I wrote about this in the comprehension extinction piece. The part that keeps me up isn't what they'll wreak this summer, it's what they'll look like in five years when they're "senior" by title but have never built a mental model from scratch. That's the bill that comes due later.
I am curious if there is any numbers/data available for the physical toll of the brain and if that is broken out by task (learning, code/quality review, etc.). There was a few weeks where I was deep in technical papers for the entire week and that felt more exhausting than other activities.
One metric might be glutamate. One blog that I have seen used that as a metric.
Yeah there's a paper on this. Wiehler et al., Current Biology 2022. They scanned people's brains across a workday and found glutamate buildup in the lateral prefrontal cortex after hard cognitive work. Literally a byproduct piling up in the region you use for control and decisions. Mental fatigue isn't a feeling, it's chemistry.
https://www.cell.com/current-biology/fulltext/S0960-9822(22)01111-3
Dense technical reading is probably the worst case because you're building the model from scratch with nothing to lean on. Code at least gives you syntax. A paper gives you prose and you hold the whole thing in your head.
Nothing clean on cost broken down by task type. If you find something, send it over.
Thank you for the thoughtful articles!
My $.02 on dealing with cognitive overload of AI generated PRs: use AI, generate a readable document that explains key points of proposed changes. It really helps to see a big picture and identify high-level architectural screw ups generated by AI before diving into a sea of code lines.
Our current template:
# Description
## What changed
<!-- Summary of the changes: what was added, modified, or removed -->
## Why
<!-- Motivation and context behind the change. Link to ticket if exists -->
## Type of change
- [ ] Bug fix (non-breaking change which fixes an issue)
- [ ] New feature (non-breaking change which adds functionality)
- [ ] Infrastructure change (breaking or non-breaking but not visible to users)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
## Checklist
- [ ] Tests pass locally
- [ ] No new TypeScript/lint errors
- [ ] I have performed a self-review of my code
- [ ] Migration is reversible (if applicable)
- [ ] API changes are backward-compatible (or marked as breaking above)
## Prior Implementation Screenshots (if appropriate, required for bug fixes):
## Implementation Screenshots (must provide):
This is an observation I had personally about a year ago, and it quite surprised me. The AI-structured tools I made at work created more work for me and teams around me like a shockwave.
If code can be generated and tested at roughly 1:1000 effort ratio compare with before, the bottleneck shifts drastically.
I don’t even think about work I do now as “coding”, that’s incidental. I am taking very large problems and handing them to Claude code, and seeing how it solves the problem, not what code it generates to do so.
I wrote something about this a decade ago - complex code which is generated on demand and thrown away when useless. When the cost goes to zero for software, how you organize solving a problem becomes much more important how.
A friend of mine coined the term: tokenization of work.He's a psychologist. Worth checking out: https://scrum-master-toolbox.org/2026/02/blog/when-boundaries-vanish-the-tokenization-of-work-and-the-wisdom-of-burnout/
Just came across this. Thank you for expressing so well something which I have been feeling since using AI more and more at work.
At the end of the day my brain feels both too empty and too full, an aching vacuum, as if I had spent the entire day eating a humongous meal, but I was still hungry.
I do a lot of strength training, and I can in a way compare the feeling to muscle failure. While at the gym you break down the muscle (in a controlled way, following a program, with a lot rest as part of the process) to rebuild it stronger, here it feels like my brain is just breaking down, day after day.
I blamed it on the context switch at first, even tho I try my best not to have multiple agents running at the same time.
Then I wondered whether it was the lack of flow: when I code "by hand", it's like I'm staring at a puzzle and I'm trying to make sense of it. Is this the corner piece? Should this go here? And at some point things click, I put music on, and I can code for hours. It's something which gives me a great deal of pleasure (that all elusive flow state!).
With AI, I'm never in flow. I saw people describing context switching over 290 agents, each of them wanting something from the "human orchestrator" as an amazing thing; I just get exhausted.
The amazing comment from Fukitol really hit a different angle: it's akin to being in a bad relationship. Wow. I wish I never had the comparison ready, but yeah, I've been in a relationship in which we were trying to communicate *a lot* but something was not aligning, no matter how many words were exchanged. Sometimes I feel like that. Trying to articulate my intent for a semi black box to capture it, understand it, and sometimes, depending on the gradient, no amount of communication will make the model do what it needs to do.
But they will almost always explain to you why they're right, or hallucinate into what they've done, leaving the burden to you to disprove it, fix it, and deal with the consequences.
A bad relationship. Great take :)
On the flip side, this AI shift is one of the reasons I took up writing here on Substack, recently: I need a creative output to get things out of my head, now that writing code is not that outlet anymore.
The muscle failure analogy is the one I wish I'd thought of. At the gym you break down in a controlled way with a recovery protocol and you come back stronger. This is uncontrolled breakdown with no recovery and no adaptation. Just the same damage again tomorrow.
The flow state part hits close. I used to lose hours in code the way you describe. Puzzle, click, music, gone. With AI I'm never gone. I'm always routing, always switching, always half-reading something a machine wrote while half-thinking about whether it matches what I asked for. That's not engineering anymore. That's project management of a very fast intern who gaslights you when you point out mistakes.
Hey, AI needs to handle the production server going doing at 3am. Problem solved lol
I’ve been circling the same problem from the governance side: AI can scale production faster than organisations can scale judgement. I am not an engineer but a cyber security professional, but the problems we face, the sheer pace of doing needing review are the same.
The trap is pretending “human review” still means control when the reviewer is exhausted, outpaced, and working from evidence they did not create. At that point the human is not in the loop so much as holding the liability.
I wrote something adjacent on this recently, less from the burnout angle and more from the trust-boundary angle: which loops should humans own, which should machines run, and where does review stop working?
https://disinfectedmind.substack.com/p/how-do-we-scale-judgement
As much as I like the article, here's a passage that got me thinking:
"The SmartBear/Cisco study established numbers everyone ignores: defect detection drops from 87% for PRs under 100 lines to 28% for PRs over 1,000 lines."
If we're talking about the same often-quoted research from 2006 (neither the link nor the linked article leads to the actual paper), I don't think they claimed anything like that. Sounds like a hallucination. They had very few data points with PRs longer than 1,000 lines, and explicitly discarded those longer than 2,000 from the data.
Now, I don't discuss the general conclusion (bigger PRs, worse defect detection), and the research supports that point of view. Yet, the numbers seem made up.
Thanks for collecting all the insights. While everyone is focusing on “layoffs anxiety”, we forget about the other anxieties. I identified 7: https://danaaonofriesei.substack.com/p/the-seven-people-in-every-team-meeting?r=1wmkip&utm_medium=ios