"Just Reviewing Code"

Soon, developers will 'just' be reviewing the work that AI tools do.

I hear this a lot lately, and it's worth unpacking. The idea is that as AI tools get better at writing code, the developer's role shifts from writing to reviewing. That might sound like a step back, like reviewing is somehow less of a craft. But I think that misses what reviewing actually is.

To me, reviewing the underlying wires of a product is one of the most impressive and important skills a developer can have. And I think the bar isn't getting lower, it's getting higher.

Code is a translation

Here's the thing about software: code is not the product. It's a translation. Someone has an idea, a need, a goal, a problem they want solved. A developer's job is to take that abstract thing and turn it into something that actually works in the real world, reliably, for real people and over time. It is not just about writing code that runs.

Think about what a developer actually holds in their head when they're doing this job well. It's not just the code itself, it's the translation between three different worlds:

Business: What are we trying to accomplish, what "good" looks like at this stage of growth.
Users: How people are actually using the product, including the ways nobody planned for.
System: How this piece of code sits inside a larger universe, what it depends on, what depends on it.

Product managers see the surface. Leadership sees the strategy. But the developer is the one who can look at a three-line change and know whether it's going to quietly cause problems six months from now. That's a genuinely unique skill.

Not every change is a big deal. But the more complex the system, the more users, the more critical the feature, the more important it is to have someone who can understand the impact and make a call.

So what does reviewing actually take?

Great code review isn't only about catching typos or enforcing style guides. Ubiquitous language and consistent structure are what let humans and agents navigate the codebase without losing the plot. But at its core, review is about decision-making. Specifically, it's about someone asking: does this align with where we're going, and is it going to hold up over time?

That question touches a lot of things.

Understanding the business. A startup moving fast has different tradeoffs than a company in a regulated industry managing critical systems. Code that's totally reasonable in one context is a liability in the other. If you don't understand what your business actually values right now, you can't evaluate whether what's in front of you is right.

Reading what isn't there. This one is underrated. Some of the most important things in a diff are the things missing from it. The error case that wasn't handled, the edge case nobody thought about, the test that would've caught the bug. The surface might look fine, but deep understanding of the system and the users can reveal the cracks.

Thinking about the future. Breaking changes are brutal, not just technically, but in terms of what they cost you operationally. Users give trust to the product and expect it to work with minimum efforts or risk on them. The good news is that you can prevent it, but only if someone asks the right question early: are we okay living with this decision for the lifetime of this product? If a feature is tightly coupled to everything else, untangling it later becomes expensive. Someone needs to catch that on the way in, and usually, it hides in the details.

Knowing the ecosystem. How is this thing being monitored? What happens to your dashboards when this ships? Where does something need to be logged to so that you, or your agents, can debug it at 2am? What doors has this opened for an attacker? These aren't someone else's concerns. They're part of the same decision. And again, it hides in the details. If you don't know the ecosystem, you won't know what to ask about it.

Holding tradeoffs. The most secure approach might not be the most user-friendly. Moving fast might mean accumulating debt and bugs. Simplicity now might mean inflexibility down the road. None of these have a universal right answer, it depends on your context, your users, your business stage. Someone has to make a call, or build the right process for delegating tradeoffs efficiently over time.

As teams grow, you need more people who can make those calls, not fewer. You can't route every ambiguous technical decision through a team lead or stakeholders. That doesn't scale. The people closest to the code have to be empowered to think at that level.

There is a shift happening in the role of developers — one that overlaps with things PMs and leadership have traditionally owned. But it's not a shift in responsibility, it is a shift in visibility and scale. Typing is getting easier, but the judgment becomes more important, not less. In order to scale out decision making, PM and leadership need to trust developers to make those calls, and developers need to be ready to own it.

The limits of automation

This is where evals and LLM-as-a-judge come in, and they're worth taking seriously. If the concern is that human review doesn't scale, it seems natural to ask: what if the reviewer was also an AI?

Evals can help. They can catch regressions, enforce consistency, flag things that look out of place. LLM-as-a-judge can surface issues faster than a human reading every diff. These tools have real value, and using them well is becoming its own skill.

But they don't close the loop on their own.

The deeper problem is that someone has to decide what the eval is even measuring. What counts as "correct" isn't self-evident. It depends on your business stage, your users, your risk tolerance, your system's history. An eval that made sense six months ago might be measuring the wrong thing today. And a judge that's never been calibrated to your context will produce confident, consistent, wrong answers.

Building a good evaluation process is itself a form of judgment. You have to decide what workflows to build, what signals actually matter, and how much trust to put in each layer. You have to choose tools and ecosystems that work for your specific needs, and then verify that they're producing results that are correct, consistent, and complete, not just results that look good on the surface. That last part is the hard part. Plausible and correct are not the same thing, and the difference only becomes visible when someone who understands the system deeply enough is paying attention.

There's also the question of improvement. Evals drift. Models change. The product evolves. The process that's validating your code today has to be maintained and updated as the ground truth shifts. That work doesn't happen automatically. Someone has to own it, understand why it's built the way it is, and have the judgment to know when it needs to change.

So yes, standardizing the review process through automation is important. It reduces noise, it makes the judgment more efficient, it scales. But humans have to be in the loop when building those processes and standards, and in improving them over time. Not because automation isn't useful, but because the thing being automated is judgment about whether something is good enough, for this system, for these users, today.

What this means right now

Writing code is getting faster. That's just true. AI tools have compressed what used to take hours into something that takes minutes, and that's only going in one direction.

The value was never really in the typing. It was in the judgment. In knowing what should be built. In understanding how the underlying system connects to the value the business is trying to create and to the users. In catching the thing that looks fine on the surface but is actually a slow-burning problem.

We once migrated to a new architecture for scale. One feature from the old implementation had a security issue in the new design. An eval could flag the vulnerability. A PM could ask an LLM to explain the blast radius. But someone first had to know what to ask: which edge cases weren't in the code, what the logs didn't capture, where the LLM's read of the risk was missing runtime context it had never seen. And when customers needed a clear answer on what was affected and why, that same depth was the only thing that made the response honest and complete. You can't ask the right question, or answer it honestly, without already knowing the system well enough to know what's missing.

That judgment doesn't only happen at the level of vision or natural language, which has a lot of ambiguity. It also happens at the formal level of the platform, the tools and the system, where nuances and tradeoffs are often hidden in the details. These things are not becoming easier with scale. Nuances and tradeoffs are only becoming more complex and more important to get right.

Reviewing is deciding

Code review becoming a central aspect of software development highlights something that was true even before LLMs: building reliable systems for real people is fundamentally an act of judgment, not just execution.

The best developers aren't just technically sharp. They're the ones who understand the translation: from business goals to understanding the users to system design to implementation details, and back again. They're the ones who notice when something is drifting out of coherence with where the product is headed. They're the ones who ask the uncomfortable question before the uncomfortable consequence.

Reviewing is not a checklist. Not a gatekeeping ritual. It is a form of ownership over something you understand better than almost anyone else in the room.