The pace of change in the last year has been relentless. I write this post not for any kind of thought-leadership, but instead so I can reference it in a few months or a few years, and see how I feel now.
At home I use Claude Code a reasonable amount. At work I have recently got access to IBM Project Bob. I use the shell version of Bob. Claude Code is more advanced, but Bob does the job. In short, I finally have a coding agent at work. Before Bob, IBM disallowed coding agents, and so I was limited to using Claude online, for tasks like research and coding queries.
Current beliefs
I wrote codeexplorer after seeing how Claude Code got a lot of its features
from smart Claude prompts rather than building smarts into the application using
“normal” code. Prior AI coding helpers had focused on being smart about building
context; Claude Code let Claude do that work instead.
The valuable thing about writing codeexplorer was in how it reinforced just
how much you could leave to the model, and so gave me the intuitive
understanding that the model is way more important than any individual agent.
Even so, one agent’s UX can be better than another, and some agents are just
better at leveraging model capabilities.
- Belief: I’m really glad I wrote rapport and codeexplorer early last year (2024). While both are quite primitive, each did its job of getting to the core of an LLM interaction model — a chat bot and a (coding) agent.
- Belief: I think the experience of writing
codeexplorerwas especially valuable, as it gave me an understanding of the criticality of the agent loop. - Belief: Model quality is still paramount. We haven’t reached a commodity state as yet where we have a suite of “good enough” models for most tasks from a wide variety of providers, and some as open-source. Proprietary frontier models still hold an edge; their advances in coding over the last year are why I believe this.
Inspired by tools.simonwillison.net, I have started using Bob at work to write tools. For me, work seems to have more little tools I want than home. Bob works fine for vibe-coding tools. An example is a simple single HTML page diff app — I always seem to want to diff against things on my clipboard, and Bob built the app in five minutes. Classic vibe-coded tool.
- Belief: AI is pretty good at frontend, and I only have rudimentary skills. Leveraging AI for hacking out basic frontends is something I look forward to doing more of.
- Belief: I don’t know much frontend coding, and I still think that this lack of domain expertise will limit the quality of frontend I can write with an AI agent. That is: I think expertise in a coding domain helps you with getting the best out of coding agents.
- Belief: Combining the above two beliefs: I think that for many tools, AI can do “well enough”. People will be able to build themselves personalised tools with relatively primitive UX, but because they are exactly the tool they need that doesn’t matter too much. Fancy UX will still require some frontend skill, and sensibility.
- Belief: Combining all of the above, it feels like there is now a genuine chance for lay people to build their own software customised to their needs. To break out of generic software that ends up shaping us, rather than being shaped to us. But I find it a little hard to see how we get there. Companies are still built around control of the experience, and control of the experience is exactly what people could wrest back with these tools. It’ll be a battle.
Looking forward
I was inspired to write this post by a trio of articles I read this week. All three capture my own personal zeitgeist at this point: that we have to figure out the relationship between human coders and AI; that AI agents lead towards burnout; and that AI is getting extremely good at coding. I don’t know where those take us, so I’ll pull out a quote and put a little commentary to each piece.
The relationship
Let’s start with The Final Bottleneck by Armin Ronacher, who wrote flask amongst other things:
Historically, writing code was slower than reviewing code.
It might not have felt that way, because code reviews sat in queues until someone got around to picking it up. But if you compare the actual acts themselves, creation was usually the more expensive part. In teams where people both wrote and reviewed code, it never felt like “we should probably program slower.”
So when more and more people tell me they no longer know what code is in their own codebase, I feel like something is very wrong here and it’s time to reflect.
I’ve been thinking for a while that AI’s ability to produce code faster and faster (and faster) is an issue. Leaving humans the tedious and hard job of reviewing code that looks right but often has subtle issues feels like a pretty big latent risk. We’re definitely better at writing code that’s correct, than at validating the correctness of code we didn’t write.
I read somewhere about someone saying that we wanted a robot to fold clothes so that we could do more art, but instead we got a robot that could do “art” and we are still stuck with the laundry. It feels like that with coding agents — we lost the joyful part of coding, but are still stuck with the shitty bit. Yes, yes, writing specs blah blah your job is making products blah — even so, a lot of us just enjoy crafting code.
AI slop PRs are becoming a problem for OSS. This feels like a clear manifestation of our not yet having norms around AI code. I like the idea of Vouch by Mitchell Hashimoto to help with this. Vouch describes the problem well:
Historically, the effort required to understand a codebase, implement a change, and submit that change for review was high enough that it naturally filtered out many low quality contributions from unqualified people. For over 20 years of my life, this was enough for my projects as well as enough for most others.
Unfortunately, the landscape has changed particularly with the advent of AI tools that allow people to trivially create plausible-looking but extremely low-quality contributions with little to no true understanding. Contributors can no longer be trusted based on the minimal barrier to entry to simply submit a change.
Also from Mitchell Hashimoto, I enjoyed My AI Adoption Journey. It’s a slightly calming take on things; perhaps they are not moving too fast. Or maybe they are, but we can still reach a balance.
The addiction
We ended the previous section with some positivity, but frankly I think things will get worse before they get better. I was feeling my way towards the conclusion in The AI Vampire by Steve Yegge, because I can sympathise with the need to do one more tweak with Bob or Claude before closing the laptop.
Agentic software building is genuinely addictive. The better you get at it, the more you want to use it. It’s simultaneously satisfying, frustrating, and exhilarating. It doles out dopamine and adrenaline shots like they’re on a fire sale.
Many have likened it to a slot machine. You pull a lever with each prompt, and get random rewards and sometimes amazing “payouts.” No wonder it’s addictive.
Variable rewards are what you get from AI agents. And the good rewards are really good. Seeing ideas brought to life so quickly is rewarding; and the agents provide more than enough success to keep you coming back.
It’s easy to see why this is addictive. And the damage that comes from addiction is real and dark and nasty.
The endgame
Putting that aside, I think we are at the point where the real question has become: how close are we to the endgame when the AI code doesn’t need to be reviewed (by a human)? Or do we accept the risks, and give in to the speed?
While 2023 and 2024 saw mostly modest gains in coding, once Claude Code hit the scene in 2025 things have exploded. Models might be advancing slowly in some ways, but because coding is hugely amenable to reinforcement learning in post-training, we’re seeing advances mount up quickly.
So I’ll close with My GPT-5.3-Codex Review — matt shumer.
“But Matt!”, you say. “Judgment is uniquely human!” I am sorry, but no.
It has become increasingly clear that as long as data for a given thing exists, a model trained on that data can do that thing. Human judgment is available in vast amounts of data on the internet. The model companies are paying tons of money for data that will help the model with judgment and taste as well. This is the first model that feels like it has internalized that at a deep level for a specific domain.
When a prompt leaves room for interpretation, GPT-5.3-Codex tends to choose what I would have chosen. It fills in missing context in a way that feels aligned with how I actually think about the problem.
Matt’s view is that GPT-5.3-Codex is another big step forward — a model that can keep going for hours at a time, and in doing so doesn’t come off the rails because its sense of judgement seems to be a step change from previous models.
Will we keep seeing the jagged frontier that is AI coding advance as quickly as it has done in 2025? It doesn’t show signs of slowing so far in 2026. But I think we are beginning to see an ugly side of this in terms of both the drudgery of endless code review, and the beginnings of addictive patterns.
Maybe 2026 is when we hit an inflection around how we work with AI. Do we give up on validation and let AI write code as fast as it can, or do we accept that our software needs to move at human speed, even if we are a bottleneck? What do we do about the slot-machine like addictiveness? Do we all get burned out?
I look forward to reading this in a year. I expect I will be surprised by at least some of what I and others believed at this point, which is why it’s good to write it down, even if this particular piece is rather slapdash.