Boring Enough to Ignore

You write gcc main.c, hit enter, and look away. Somewhere in there your source becomes a few thousand lines of x86 you will never read. If I told you to sit and review the assembly the compiler emitted, ready to pounce when it picked the wrong register, you would think I had lost the plot.

Now watch me on a normal Tuesday. Four agents are running. Two are writing tests, one is refactoring a module I half-remember, one is chasing a flaky CI job. The dashboard is green. I have eight tabs and a feeling I would describe, if you asked me in the moment, as leverage. I am doing the work of four engineers. The numbers say so.

My body says something else. I am not absorbed in anything. My eyes go to the terminal that just printed a line, then to the one that went quiet too long, then back. I am not building. I am watching for the moment one of them does something wrong, so I can catch it. That is not flow. That is a night shift in a control room, tired in the specific way you are tired after a long drive in fog: alert the whole time, with nothing to show for the alertness.¹

The thing I am calling leverage is vigilance, and the two feel alike from the inside for about the first hour.

Here is the tell. Flow has a direction: you are moving toward something, and the feedback comes from the thing itself getting closer. Vigilance has no direction. You hold still, waiting for a signal that mostly does not come, and when it does you drop everything and context-switch into a problem something else started. The dashboard sells me the first story. My attention is living the second.

And the closer the agents get to working, the worse this gets. Humans are measurably terrible at watching a source that rarely does anything: sustained attention to a low-event display falls apart after about half an hour, and the rarer the event, the worse you are at catching it.² So a green dashboard is not reassuring, it is the problem. The reward for good agents is a worse watchman.

None of this is new, which is the part that should bother you.

In 1983 Lisanne Bainbridge named the irony of automation. You automate the easy parts because a machine does them well, and what you leave the human is the two things humans are worst at: monitoring a system that is fine, and taking over the instant it is not. The monitoring erodes the very skill the takeover needs, so you are kept on call precisely for the case you are now least equipped to handle.³ Swap "process plant" for "agent fleet" and she described my Tuesday forty years early. We just put a nicer dashboard on it.

And the industry's instinct is to make the dashboard nicer still. More observability, more diff review, a human in every loop. Watch the agents better.

The obvious fix is exactly wrong, because it doubles down on the one task we are bad at. Go back to the compiler to see why.

Nobody watches a compiler, and a compiler took over a far more total job than any agent has. We flow right past it. The reason is the opposite of vigilance: not that we watch it well, but that we stopped watching it at all. And the reason we could stop is not that the compiler earned our trust. Trust is a feeling, and feelings are uncalibrated; blind trust and blanket distrust are both documented failures.⁴ Boring is a property. You do not ignore the compiler because you believe in it. You ignore it because the type checker, the test, the crash will tell you if it lied. The job got boring enough to ignore, and your attention floated up a level to the thing you care about.

That, and not better dashboards, is the real craft of orchestrating agents. Not running more of them. Making more of what they do boring enough to ignore.

And boring is buildable, not a vibe. A diff is boring when a failing test will catch what I did not read. A command is boring when it runs in a sandbox that cannot touch anything I care about, so the worst case is a wasted minute. A change is boring when it is small enough to revert without archaeology, when it crosses a typed contract that refuses to compile if it lies. Each of these is a wall. Inside the wall I do not have to watch, because the wall watches.⁵ The trust is an output of that machinery, not an input. Flow is downstream of the machinery, never of willpower.

The test for whether a piece is boring yet is one line: its failure has to be cheap and loud instead of expensive and silent. A wall you cannot feel break is not a wall.

Here is where I have to stop selling it. A compiler is deterministic. It earned global silence: valid input in, correct output every time, ignore it everywhere, forever. An agent is probabilistic. It can be right a thousand times and wrong the thousand-and-first in a way no rule forbids, so there may be no point at which you can stop watching all of it.

So you stop watching parts of it.

That is the craft, and it is smaller and more concrete than "orchestration." You do not make an agent trustworthy. You wall off one slice of its output until watching that slice earns you nothing, and you do this ruthlessly, slice by slice. You are not certifying the agent. You are fencing the blast radius of one task.

And the leak runs the other way too. When an abstraction leaks, the hidden layer below grabs your attention back.⁶ With a compiler the rare leak is performance. With an agent the leak is intent, taste, context, security, architecture, the question of who is responsible when it is wrong. Those never seal. They are yours. So the work splits cleanly: the part you fenced, which is boring, and the part that leaks, which is the only part worth your flow.

Most people invert this. They watch the diffs, which a test could watch, and skim the architecture, which only they can. They spend their scarcest attention on the work a wall could do and wave through the work that has no wall and never will. That is the tax the dashboard hides: the slice of attention an agent quietly garnishes from you while you admire the throughput, and attention is the only thing you brought to the desk that was ever scarce.

So the leverage question is wrong from the start. Everyone asks how many agents they can run at once. That number is a vanity metric; it measures how thin you spread, which is how much you watch. Ask the other question. How much of this have I made boring enough to ignore? That is the count of slices where failure is caught without you. Drive it up and you do not get a busier operator. You get a quiet one, back in flow on the part that leaks, while the walls hold the rest. Run only as many agents as you have walls for.

Some of it will not wall off. The judgment, the taste, the call on what is worth building: no test catches a wrong intent. That part you watch forever, and you should. That is not a gap in the method. It is the method telling you where you are still the engineer.

So stop counting agents. Walk your pipeline and mark each piece boring or leaky. For every leaky piece, build the wall that makes it boring, or admit it is yours to watch and stop pretending an agent owns it. The walls are the work now.

You will know you have it right when the agents get dull, when you forget they are running, when a thing breaks and a red test, not your tired attention, is what tells you. That is not negligence. That is what flow with a machine has always felt like, the whole time, going back to the first compiler nobody ever watched. The only question left is the one worth sitting with: which parts of this have I made boring enough to stop watching, and which parts have I only managed to stop looking at.

Notes

The state I keep contrasting with vigilance is flow, in Mihaly Csikszentmihalyi's sense: total absorption in a task done for its own sake, requiring a challenge-skill balance, clear goals, immediate feedback, and the merging of action and awareness (Flow: The Psychology of Optimal Experience, 1990). It is defined for a doer, not a supervisor, which is the asymmetry the whole essay leans on. Monitoring a fleet supplies almost none of it: the goal is "catch the rare error," the feedback is sparse, and action and awareness split apart. The interruption cost compounds the loss; Peopleware (DeMarco and Lister, 1987) puts re-entry into flow after a break at roughly fifteen minutes, which an agent demanding attention every few minutes never lets you pay back. Flicking between agent panes is a manager's schedule you imposed on yourself by accident (Paul Graham, "Maker's Schedule, Manager's Schedule", 2009). ↩
The vigilance decrement: Norman Mackworth's 1948 "Clock Test" found sustained detection of rare signals dropping sharply after about 30 minutes on watch, an effect replicated across sensory modalities since. Watching for infrequent failures is close to a worst case for human attention. Mackworth, N. H. (1948), "The breakdown of vigilance during prolonged visual search," Quarterly Journal of Experimental Psychology 1(1):6-21; overview: Wikipedia, "Vigilance (psychology)". ↩
Lisanne Bainbridge, "Ironies of Automation," Automatica 19(6):775-779, 1983. The two ironies I lean on: the designer who automates the easy parts leaves the operator the parts that could not be automated, which are the hardest ones, and a human asked to monitor an automated system loses, through disuse, the manual skill needed to take over when it fails. Mirror: gwern.net PDF. ↩
Parasuraman and Riley, "Humans and Automation: Use, Misuse, Disuse, Abuse," Human Factors 39(2):230-253 (1997). Reliance has to be calibrated to actual reliability; automation bias (over-trust) and disuse (under-trust) are symmetric failure modes. My claim is sharper than calibration: walling the work off tries to remove the need to rely on a per-task judgment at all, by making the cost of being wrong inside the wall low enough that reliance is not the load-bearing thing. APA record. ↩
The concrete mechanisms (tests, sandboxes, typed contracts, small reversible diffs, rollback, provenance) are not novel; what is new is treating them as attention infrastructure rather than just quality infrastructure. Their job here is to let you not look. This is also why "feels like leverage" is an unreliable signal: in a 2025 randomized trial, 16 experienced open-source developers using AI tools were about 19% slower at their tasks while believing they were about 20% faster. Small sample, mature repos, expert maintainers, one toolset and moment in time, so read it as a sharp illustration of the perception gap, not a universal slowdown. METR, "Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity". ↩
Joel Spolsky, "The Law of Leaky Abstractions," 2002: "All non-trivial abstractions, to some degree, are leaky," and when they leak the hidden lower layer suddenly demands the expertise you thought you had abstracted away. With deterministic tools the leak is usually performance. With agents the leak is intent, taste, context, security, architecture, and responsibility, which is why those parts never become boring. joelonsoftware.com. This is also the tension I am not resolving. Sandboxes escape, tests have coverage holes, types encode only some of intent, and "small diff" is a heuristic, not a guarantee. The honest version of "boring enough to ignore" is "boring enough that the residual risk is below your bar," and you still have to set the bar. A deterministic seal can hold globally; a probabilistic one may only ever hold locally, and the place it stops holding is exactly where your judgment leaks back in. I would rather state that than promise an autonomy the shape of the abstraction may not allow. ↩