AI did not remove review. It moved more work there.
That is the part I keep coming back to. The demos are all about generation: write the component, build the app, fix the bug, scaffold the service. The quieter cost shows up after the code exists. Someone still has to decide whether the code should exist, whether the shape is right, and whether the extra surface area is worth owning.
The frontier teams are already living in that world. OpenAI said GPT-5.3-Codex was “instrumental in creating itself.” Reports around Claude Code and Cowork tell the same story from the Anthropic side: small human teams describe the work, steer the agent, and review the output. That sounds like a productivity win until you follow the burden all the way to the reviewer.
The old bottleneck was getting enough code written. The new bottleneck is proving that a larger amount of code is necessary, understandable, and covered by the right tests.
I think the review question is changing. It used to be “does this work?” That question still matters, but it is no longer enough. The more useful question is “should this much code exist?”
LogRocket had an example that stuck with me: a 29-line implementation turned into 186 lines from Claude Code. Same feature. More branches, more defensive handling, more surface area. That is not automatically bad. Sometimes the generated version is catching real edge cases. But someone has to read every line and decide which parts are care and which parts are clutter.
The OCaml maintainers ran into the harsher version of this. They rejected a 13,000-line AI-generated pull request. The issue was not simply whether the code compiled. It was review bandwidth, ownership, copyright risk, and whether future maintainers could explain the code at 2 a.m. when it broke.
That last part matters more than the benchmark charts. Code that works today but is not understood by anyone on the team is not free. It is borrowed confidence.
Cursor buying Graphite made more sense to me after that. Code generation is crowded. Review is where the pressure lands. Agent Trace is interesting for the same reason. Knowing which model produced a line, which conversation led to it, and which commit carried it gives teams a better audit trail. It does not prove the code is correct, but it gives reviewers a starting point.
Payments makes this less abstract. A sloppy UI branch is annoying. A sloppy payment path is a support queue, a reconciliation problem, a chargeback trail, or an audit conversation. “The AI wrote it and I glanced at the diff” is not a serious answer when money moved incorrectly.
The security numbers are ugly enough that I do not want to overstate them. The exact rates will move around as models and tooling change. The direction is still hard to ignore: AI-generated code is producing more review findings, more logic mistakes, and more security issues than teams expected. If generated code touches auth, money movement, customer data, or accounting state, the burden of proof needs to go up, not down.
This is the uncomfortable part: most developers do not love review. We understand why it exists, but given a choice between building something and auditing someone else’s patch, most of us would rather build. AI makes that tradeoff sharper. Now the patch may be larger, more polished, and less connected to a human author’s intent.
So the engineering job shifts. Less typing, more specification. Less authorship, more verification. The valuable skill is not producing lines quickly. The valuable skill is knowing what correct means before the lines exist, then building enough evidence that the team can trust the result.
That is why tests keep coming back into the center of this for me. Tests are not magic, and a green suite can still miss the important bug. But without tests, AI-assisted development turns into review by vibes. The code compiles. The assistant sounds confident. Everyone is tired. The diff goes in.
The path forward is probably less glamorous than the demos: smaller PRs, clearer ownership, traceable AI use, tests written against the requirement, property checks where invariants matter, and reviewers who are allowed to reject code they cannot explain.
I do not think this makes AI coding bad. I think it makes review discipline more important than generation speed. If the team cannot review the work, the speed is fake.
Sources I kept around while editing:
- OpenAI on GPT-5.3-Codex
- Axios on Anthropic Cowork
- LogRocket on AI moving the bottleneck to review
- DevClass on the OCaml AI-generated pull request
- Cursor’s Agent Trace spec
- Cursor on Graphite joining Cursor
- Addy Osmani on proving AI-written code works
- ITPro on AI-generated code security findings
- Cortex on engineering in the age of AI