I want to move on to the next phase of AI programming. All these SKILLS, agentic programming and what not reminds me of the time of servlets, rmi, flash… all of that is obsolete, we have better tools now. Hope we can soon reach the “json over http” version of AI: simple but powerful.
Level 4 is where I see the most interesting design decisions get made, and also where most practitioners take a shortcut that compounds badly later.
When the author talks about "codifying" lessons, the instinct for most people is to update the rules file. That works fine for conventions - naming patterns, library preferences, relatively stable stuff. But there's a different category of knowledge that rules files handle poorly: the why behind decisions. Not what approach was chosen, but what was rejected and why the tradeoff landed where it did.
"Never use GraphQL for this service" is a useful rule to have in CLAUDE.md. What's not there: that GraphQL was actually evaluated, got pretty far into prototyping, and was abandoned because the caching layer had been specifically tuned for REST response shapes, and the cost of changing that was higher than the benefit for the team's current scale. The agent follows the rule. It can't tell when the rule is no longer load-bearing.
The place where this reasoning fits most naturally is git history - decisions and rejections captured in commit messages, versioned alongside the code they apply to. Good engineers have always done this informally. The discipline to do it consistently enough that agents can actually retrieve and use it is what's missing, and structuring it for that purpose is genuinely underexplored territory.
At level 7, this matters more than people expect. Background agents running across sessions with no human-in-the-loop have nothing to draw on except whatever was written down. A stale rules file in that context doesn't just cause mistakes - it produces confident mistakes.
I had a hunch that this comment was LLM-generated, and the last paragraph confirmed it. Kudos for managing to get so many upvotes though.
"Where most [X] [Y]" is an up and coming LLM trope, which seems to have surfaced fairly recently. I have no idea why, considering most claims of that form are based on no data whatsoever.
As a lowly level 2 who remains skeptical of these software “dark factories” described at the top of this ladder, what I don’t understand is this:
If software engineering is enough of a solved problem that you can delegate it entirely to LLM agents, what part of it remains context-specific enough that it can’t be better solved by a general-purpose software factory product? In other words, if you’re a company that is using LLMs to develop non-AI software, and you’ve built a sufficient factory to generate that software, why don’t you start selling the factory instead of whatever you were selling before? It has a much higher TAM (all of software)
We are not there yet. While there are teams applying dark factory models to specific domains with self-reported success, it's yet to be proven, or generalizable enough to apply everywhere.
Why sell the factory when you can create automated software cloner companies that make millions off of instantly copying promising startups as soon as they come out of stealth?
If you could get a dark factory working when others don't have one, you can make much more money using it than however much you can make selling it
That’s not true. Even if we assume LLMs can generate the code needed to support the next Facebook, one still has to: buy/rent tons of hardware (virtual or baremetal), put tons of money in marketing, break the network effect, pay for 3rd party services for monitoring, alerting and what not. That’s money, and LLMs don’t help with that
Also a measly level 2er. I'm curious what kind of project truly needs an autonomous agent team Ralph looping out 10,000 LOCs per hour? Seems like harness-maxxing is a competitive pursuit in its own right existing outside the task of delivering software to customers.
Feels like K8s cult, overly focused on the cleverness of _how_ something is built versus _what_ is being built.
Codex and Claude Code are these (proto)factories you talk about - almost every programmer uses them now.
And when they will be fully dark factories, yes, what will happen is that a LOT of software companies will just disappear, they will be dis-intermediated by Codex/Claude Code.
Floating what you call levels 6, 7 and 8. I have a strong harness, but manually kick off the background agents which pick up tasks I queue while off my machine.
I've experimented with agent teams. However the current implementation (in Claude Code) burns tokens. I used 1 prompt to spin up a team of 9+ agents: Claude Code used up about 1M output tokens. Granted, it was a long; very long horizon task. (It kept itself busy for almost an hour uninterrupted). But 1M+ output tokens is excessive. What I also find is that for parallel agents, the UI is not good enough yet when you run it in the foreground. My permission management is done in such a way that I almost never get interrupted, but that took a lot of investment to make it that way. Most users will likely run agent teams in an unsafe fashion. From my point of view the devex for agent teams does not really exist yet.
I really like your post and agree with most things.
The one thing I am not fully sure about:
> Look at your app, describe a sequence of changes out loud, and watch them happen in front of you.
The problem a lot of times is that either you don't know what you want, or you can't communicate it (and usually you can't communicate it properly because you don't know exactly what you want). I think this is going to be the bottleneck very soon (for some people, it is already the bottleneck). I am curious what are your thoughts about this? Where do you see that going, and how do you think we can prepare for that and address that. Or do you not see that to be an issue?
I coded a level 8 orchestration layer in CI for code review, two months before Claude launched theirs.
It's very powerful and agents can create dynamic microbenchmarks and evaluate what data structure to use for optimal performance, among other things.
I also have validation layers that trim hallucinations with handwritten linters.
I'd love to find people to network with. Right now this is a side project at work on top of writing test coverage for a factory. I don't have anyone to talk about this stuff with so it's sad when I see blog posts talking about "hype".
Do you feel like you are still learning about the programming language(s) and other technologies you are using? Or do you feel like you are already a master at them?
Do you ever take the time to validate what one of the agents produces by going to the docs? Or is all debugging/changing of the code done via LLMs/agents?
I'm more like level 2 right now and genuinely curious if you feel like learning continues for you (besides with agentic orchestration, etc.) And if not, whether or not you think that matters.
I'm learning more than ever before. I'm not a master at anything but I am getting basic proficiency in virtually everything.
> Do you ever take the time to validate what one of the agents produces by going to the docs? Or is all debugging/changing of the code done via LLMs/agents?
I divide my work into vibecoding PoC and review. Only once I have something working do I review the code. And I do so through intense interrogation while referencing the docs.
> I'm more like level 2 right now and genuinely curious if you feel like learning continues for you (besides with agentic orchestration, etc.)
Level 8 only works in production for a defined process where you don't need oversight and the final output is easy to trust.
For example, I made a code review tool that chunks a PR and assigns rule/violation combos to agents. This got a 20% time to merge reduction and catches 10x the issues as any other agent because it can pull context. And the output is easy to incorporate since I have a manager agent summarize everything.
Likewise, I'm working on an automatic performance tool right now that chunks code, assigns agents to make microbenchmarks, and tries to find optimization points. The end result should be easy to verify since the final suggestion would be "replace this data structure with another, here's a microbenchmark proving so".
Got it. This all makes sense to me. Very targeted tooling that is specific to your company's CI platform as opposed to a dark factory where you're creating a bunch of new code no one reads. And it sounds like these level 8 agents are given specific permission for everything they're allowed to do ahead of time. That seems sound from an engineering perspective.
Also would be interested in an example of "validation layers that trim hallucinations with handwritten linters" but understand if that's not something you can share. Either way, thanks for responding!
Level 15 (if not succumbed to fatal context poisoning from malicious agent crime syndicate): Agents creating corporations to code agentic marketplaces in which to gamble their own crypto currencies until they crash the real economy of humans.
The steps are small at the front and huge on the bottom, and carries a lot of opinions on the last 2 steps (but specifically on step 7)
That's a smell for where the author and maybe even the industry is.
Agents don't have any purpose or drive like human do, they are probabilistic machines, so eventually they are limited by the amount of finite information they carry. Maybe that's what's blocking level 8, or blocking it from working like a large human organization.
The thing blocking level 8 isn't the difficulty of orchestration, it's the cost of validation. The quality of your software is a function of the amount of time you've spent validating it, and if you produce 100x more code in a given time frame, that code is going to get 1/100th as much validation, and your product will be lower quality as a result.
> If your repo requires a colleague's approval before merge, and that colleague is on level 2, still manually reviewing PRs, that stifles your throughput. So it is in your best interest to pull your team up.
Until you build an AI oncaller to handle customer issues in the middle of the night (and depending on your product an AI who can be fired if customer data is corrupted/lost), no team should be willing to remove the "human reviews code step.
For a real product with real users, stability is vastly more important than individual IC velocity. Stability is what enables TEAM velocity and user trust.
In my opinion there are 2 levels, human writes the code with AI assist or AI writes the code with human assist; centuar or reverse-centuar. But this article tries to focus on the evolution of the ideas and mistakenly terms them as levels (indicating a skill ladder as other commenters have noted) when they are more like stages that the AI ecosystem has evolved through. The article reads better if you think of it that way.
These are levels of gatekeeping. The items are barely related to each other. Lists like these will only promote toxicity, you should be using the tools and techniques that solve your problems and fit your comfort levels.
I prefer Dan Shapiro's 5 level analogy (based on car autonomy levels) because it makes for a cleaner maturity model when discussing with people who are not as deeply immersed in the current state of the art. But there are some good overall insights in this piece, and there are enough breadcrumbs to lead to further exploration, which I appreciate. I think levels 3 and 4 should be collapsed, and the real magic starts to happen after combining 5 and 6; maybe they should be merged as well.
Yegge's list resonated a little more closely with my progression to a clumsy L8.
I think eventually 4-8 will be collapsed behind a more capable layer that can handle this stuff on its own, maybe I tinker with MCP settings and granular control to minmax the process, but for the most part I shouldn't have to worry about it any more than I worry about how many threads my compiler is using.
>"Yegge's list resonated a little more closely with my progression to a clumsy L8."
I thought level 8 was a joke until Claude Code agent teams. Now I can't even imagine being limited to working with a single agent. We will be coordinating teams of hundreds by years end.
Level4 is most interesting to me right now. And I would say we as an industry are still figuring out the right ergonomics and UX around these four things.
I spend a great deal of my time planning and assessing/reviewing through various mechanisms. I think I do codify in ways when I create a skill for any repeated assessment or planning task.
> To be clear, planning as a general practice isn't going away. It's just changing shape. For newer practitioners, plan mode remains the right entry point (as described in Levels 1 and 2). But for complex features at Level 7, "planning" looks less like writing a step-by-step outline and more like exploration: probing the codebase, prototyping options in worktrees, mapping the solution space. And increasingly, background agents are doing that exploration for you.
I mean, it's worth noting that a lot of plan modes are shaped to do the Socratic discovery before creating plans. For any user level. Advanced users probably put a great deal of effort (or thought) into guiding that process themselves.
> ralph loops (later on)
Ralph loops have been nothing but a dramatic mess for me, honestly. They disrupt the assessment process where humans are needed. Otherwise, don't expect them to go craft out extensive PRD without massive issues that is hard to review.
- It would seem that this is a Harness problem in terms of how they keep an agent working and focused on specific tasks (in relation to model capability), but not something maybe a user should initiate on their own.
Oceania has always been context engineering. Its been interesting to see this prioritized in the zeitgeist over the last 6 months from the "long context" zeitgeist.
Good taxonomy. One thing missing from most discussions at these levels is how agents discover project context — most tools still rely on vendor-specific files (CLAUDE.md, .cursorrules). Would love to see standardization at that layer too.
> Voice-to-voice (thought-to-thought, maybe?) interaction with your coding agent — conversational Claude Code, not just voice-to-text input — is a natural next step.
Maybe it's just me, but I don't see the appeal in verbal dictation, especially where complexity is involved. I want to think through issues deliberately, carefully, and slowly to ensure I'm not glossing over subtle nuances. I don't find speaking to be conducive to that.
For me, the process of writing (and rewriting) gives me the time, space, and structure to more precisely articulate what I want with a more heightened degree of specificity. Being able to type at 80+ wpm probably helps as well.
The power of voice dictation for me is that I can get out every scrap of nuance and insight I can think of as unfiltered verbal diarrhea. Doing this gives me solidly an extra 9 in chance of getting good outputs.
Stream of consciousness typing for me is still slower and causes me to buffer and filter more and deliberately crafting a perfect prompt is far slower still.
LLMs are great at extracting the essence of unstructured inputs and voice lets me take best advantage of that.
Voice output, on the other hand, is completely useless unless perhaps it can play at 4x speed. But I need to be able to skim LLM output quickly and revisit important points repeatedly. Can't see why I'd ever want to serialize and slow that down.
>(Re: level 8) "...I honestly don't think the models are ready for this level of autonomy for most tasks. And even if they were smart enough, they're still too slow and too token-hungry for it to be economical outside of moonshot projects like compilers and browser builds (impressive, but far from clean)."
This is increasingly untrue with Opus 4.6. Claude Max gives you enough tokens to run ~5-10 agents continuously, and I'm doing all of my work with agent teams now. Token usage is up 10x or more, but the results are infinitely better and faster. Multi-agent team orchestration will be to 2026 what agents were to 2025. Much of the OP article feels 3-6 months behind the times.
I want to move on to the next phase of AI programming. All these SKILLS, agentic programming and what not reminds me of the time of servlets, rmi, flash… all of that is obsolete, we have better tools now. Hope we can soon reach the “json over http” version of AI: simple but powerful.
Level 4 is where I see the most interesting design decisions get made, and also where most practitioners take a shortcut that compounds badly later.
When the author talks about "codifying" lessons, the instinct for most people is to update the rules file. That works fine for conventions - naming patterns, library preferences, relatively stable stuff. But there's a different category of knowledge that rules files handle poorly: the why behind decisions. Not what approach was chosen, but what was rejected and why the tradeoff landed where it did.
"Never use GraphQL for this service" is a useful rule to have in CLAUDE.md. What's not there: that GraphQL was actually evaluated, got pretty far into prototyping, and was abandoned because the caching layer had been specifically tuned for REST response shapes, and the cost of changing that was higher than the benefit for the team's current scale. The agent follows the rule. It can't tell when the rule is no longer load-bearing.
The place where this reasoning fits most naturally is git history - decisions and rejections captured in commit messages, versioned alongside the code they apply to. Good engineers have always done this informally. The discipline to do it consistently enough that agents can actually retrieve and use it is what's missing, and structuring it for that purpose is genuinely underexplored territory.
At level 7, this matters more than people expect. Background agents running across sessions with no human-in-the-loop have nothing to draw on except whatever was written down. A stale rules file in that context doesn't just cause mistakes - it produces confident mistakes.
I had a hunch that this comment was LLM-generated, and the last paragraph confirmed it. Kudos for managing to get so many upvotes though.
"Where most [X] [Y]" is an up and coming LLM trope, which seems to have surfaced fairly recently. I have no idea why, considering most claims of that form are based on no data whatsoever.
A good rule would then be to capture such reasoning, at least when made during the session with the agent, in the commit messages the agent creates.
That’s exactly the direction I went with. Working on a spec for exactly this - planning to post it here soon:
https://github.com/berserkdisruptors/contextual-commits
As a lowly level 2 who remains skeptical of these software “dark factories” described at the top of this ladder, what I don’t understand is this:
If software engineering is enough of a solved problem that you can delegate it entirely to LLM agents, what part of it remains context-specific enough that it can’t be better solved by a general-purpose software factory product? In other words, if you’re a company that is using LLMs to develop non-AI software, and you’ve built a sufficient factory to generate that software, why don’t you start selling the factory instead of whatever you were selling before? It has a much higher TAM (all of software)
We are not there yet. While there are teams applying dark factory models to specific domains with self-reported success, it's yet to be proven, or generalizable enough to apply everywhere.
Why sell the factory when you can create automated software cloner companies that make millions off of instantly copying promising startups as soon as they come out of stealth?
If you could get a dark factory working when others don't have one, you can make much more money using it than however much you can make selling it
That’s not true. Even if we assume LLMs can generate the code needed to support the next Facebook, one still has to: buy/rent tons of hardware (virtual or baremetal), put tons of money in marketing, break the network effect, pay for 3rd party services for monitoring, alerting and what not. That’s money, and LLMs don’t help with that
Producing the software is only a small part of the picture when it comes to generating revenue.
So far, we haven’t seen much to suggest that LLMs can (yet) replace sales and most of the related functions.
Too bad they cant
Also a measly level 2er. I'm curious what kind of project truly needs an autonomous agent team Ralph looping out 10,000 LOCs per hour? Seems like harness-maxxing is a competitive pursuit in its own right existing outside the task of delivering software to customers.
Feels like K8s cult, overly focused on the cleverness of _how_ something is built versus _what_ is being built.
Software that is otherwise not feasible for humans to build by hand.
I have the same question about people who sell "get rich with real estate" seminars.
Codex and Claude Code are these (proto)factories you talk about - almost every programmer uses them now.
And when they will be fully dark factories, yes, what will happen is that a LOT of software companies will just disappear, they will be dis-intermediated by Codex/Claude Code.
Floating what you call levels 6, 7 and 8. I have a strong harness, but manually kick off the background agents which pick up tasks I queue while off my machine.
I've experimented with agent teams. However the current implementation (in Claude Code) burns tokens. I used 1 prompt to spin up a team of 9+ agents: Claude Code used up about 1M output tokens. Granted, it was a long; very long horizon task. (It kept itself busy for almost an hour uninterrupted). But 1M+ output tokens is excessive. What I also find is that for parallel agents, the UI is not good enough yet when you run it in the foreground. My permission management is done in such a way that I almost never get interrupted, but that took a lot of investment to make it that way. Most users will likely run agent teams in an unsafe fashion. From my point of view the devex for agent teams does not really exist yet.
I really like your post and agree with most things. The one thing I am not fully sure about:
> Look at your app, describe a sequence of changes out loud, and watch them happen in front of you.
The problem a lot of times is that either you don't know what you want, or you can't communicate it (and usually you can't communicate it properly because you don't know exactly what you want). I think this is going to be the bottleneck very soon (for some people, it is already the bottleneck). I am curious what are your thoughts about this? Where do you see that going, and how do you think we can prepare for that and address that. Or do you not see that to be an issue?
Reminds me of a colleague who said they don't need to learn to type faster, since they use the time to think what they want to write.
I coded a level 8 orchestration layer in CI for code review, two months before Claude launched theirs.
It's very powerful and agents can create dynamic microbenchmarks and evaluate what data structure to use for optimal performance, among other things.
I also have validation layers that trim hallucinations with handwritten linters.
I'd love to find people to network with. Right now this is a side project at work on top of writing test coverage for a factory. I don't have anyone to talk about this stuff with so it's sad when I see blog posts talking about "hype".
Do you feel like you are still learning about the programming language(s) and other technologies you are using? Or do you feel like you are already a master at them?
Do you ever take the time to validate what one of the agents produces by going to the docs? Or is all debugging/changing of the code done via LLMs/agents?
I'm more like level 2 right now and genuinely curious if you feel like learning continues for you (besides with agentic orchestration, etc.) And if not, whether or not you think that matters.
I'm learning more than ever before. I'm not a master at anything but I am getting basic proficiency in virtually everything.
> Do you ever take the time to validate what one of the agents produces by going to the docs? Or is all debugging/changing of the code done via LLMs/agents?
I divide my work into vibecoding PoC and review. Only once I have something working do I review the code. And I do so through intense interrogation while referencing the docs.
> I'm more like level 2 right now and genuinely curious if you feel like learning continues for you (besides with agentic orchestration, etc.)
Level 8 only works in production for a defined process where you don't need oversight and the final output is easy to trust.
For example, I made a code review tool that chunks a PR and assigns rule/violation combos to agents. This got a 20% time to merge reduction and catches 10x the issues as any other agent because it can pull context. And the output is easy to incorporate since I have a manager agent summarize everything.
Likewise, I'm working on an automatic performance tool right now that chunks code, assigns agents to make microbenchmarks, and tries to find optimization points. The end result should be easy to verify since the final suggestion would be "replace this data structure with another, here's a microbenchmark proving so".
Got it. This all makes sense to me. Very targeted tooling that is specific to your company's CI platform as opposed to a dark factory where you're creating a bunch of new code no one reads. And it sounds like these level 8 agents are given specific permission for everything they're allowed to do ahead of time. That seems sound from an engineering perspective.
Also would be interested in an example of "validation layers that trim hallucinations with handwritten linters" but understand if that's not something you can share. Either way, thanks for responding!
I got my own level 8 factory working in the last few days and it’s been exhilarating. Mine is based on OpenAI’s Symphony[1], ported to TypeScript.
Would be happy to swap war stories.
<myhnusername>@gmail.com
How much money have you made with this approach
I think the opposite question is more prevalent, how much money have you spent?
... is that the purpose of life? The sole reason for doing anything?
Level 9: agent managers running agent teams Level 10: agent CEOs overseeing agent managers Level 11: agent board of directors overseeing the agent CEO
Level 12: agent superintelligence - single entity doing everything
Level 13: agent superagent, agenting agency agentically, in a loop, recursively, mega agent, agentic agent agent agency super AGI agent
Level 14: A G E N T
Level 15 (if not succumbed to fatal context poisoning from malicious agent crime syndicate): Agents creating corporations to code agentic marketplaces in which to gamble their own crypto currencies until they crash the real economy of humans.
Level 16: it’s not level 16, it’s level 17.
The steps are small at the front and huge on the bottom, and carries a lot of opinions on the last 2 steps (but specifically on step 7)
That's a smell for where the author and maybe even the industry is.
Agents don't have any purpose or drive like human do, they are probabilistic machines, so eventually they are limited by the amount of finite information they carry. Maybe that's what's blocking level 8, or blocking it from working like a large human organization.
Level 8 sounds a lot like the StrongDM AI Dark Software Factory published last month:
https://factory.strongdm.ai/techniques
Techniques covered in-depth + Attractor open source implementations:
https://factory.strongdm.ai/products/attractor#community
https://github.com/search?q=strongdm+attractor&type=reposito...
https://github.com/strongdm/attractor/forks
I'm continuing to study and refine my approach to leverage all this.
The thing blocking level 8 isn't the difficulty of orchestration, it's the cost of validation. The quality of your software is a function of the amount of time you've spent validating it, and if you produce 100x more code in a given time frame, that code is going to get 1/100th as much validation, and your product will be lower quality as a result.
Spec driven development can reduce the amount of re-implementation that is required due to requirements errors, but we need faster validation cycles. I wrote a rant about this topic: https://sibylline.dev/articles/2026-01-27-stop-orchestrating...
> If your repo requires a colleague's approval before merge, and that colleague is on level 2, still manually reviewing PRs, that stifles your throughput. So it is in your best interest to pull your team up.
Until you build an AI oncaller to handle customer issues in the middle of the night (and depending on your product an AI who can be fired if customer data is corrupted/lost), no team should be willing to remove the "human reviews code step.
For a real product with real users, stability is vastly more important than individual IC velocity. Stability is what enables TEAM velocity and user trust.
In my opinion there are 2 levels, human writes the code with AI assist or AI writes the code with human assist; centuar or reverse-centuar. But this article tries to focus on the evolution of the ideas and mistakenly terms them as levels (indicating a skill ladder as other commenters have noted) when they are more like stages that the AI ecosystem has evolved through. The article reads better if you think of it that way.
There is another level - AI writes the code with AI assist.
That is just another level of reverse centaur and will eventually have a human ass attached to it.
These are levels of gatekeeping. The items are barely related to each other. Lists like these will only promote toxicity, you should be using the tools and techniques that solve your problems and fit your comfort levels.
I prefer Dan Shapiro's 5 level analogy (based on car autonomy levels) because it makes for a cleaner maturity model when discussing with people who are not as deeply immersed in the current state of the art. But there are some good overall insights in this piece, and there are enough breadcrumbs to lead to further exploration, which I appreciate. I think levels 3 and 4 should be collapsed, and the real magic starts to happen after combining 5 and 6; maybe they should be merged as well.
Yegge's list resonated a little more closely with my progression to a clumsy L8.
I think eventually 4-8 will be collapsed behind a more capable layer that can handle this stuff on its own, maybe I tinker with MCP settings and granular control to minmax the process, but for the most part I shouldn't have to worry about it any more than I worry about how many threads my compiler is using.
I was surprised the author didn’t mention Yegge’s list (or maybe I missed it in my skim).
>"Yegge's list resonated a little more closely with my progression to a clumsy L8."
I thought level 8 was a joke until Claude Code agent teams. Now I can't even imagine being limited to working with a single agent. We will be coordinating teams of hundreds by years end.
I will not put it into a ladder. It implies that the higher the rank, the better. However, you want to choose the best solution for your needs.
Level4 is most interesting to me right now. And I would say we as an industry are still figuring out the right ergonomics and UX around these four things.
I spend a great deal of my time planning and assessing/reviewing through various mechanisms. I think I do codify in ways when I create a skill for any repeated assessment or planning task.
> To be clear, planning as a general practice isn't going away. It's just changing shape. For newer practitioners, plan mode remains the right entry point (as described in Levels 1 and 2). But for complex features at Level 7, "planning" looks less like writing a step-by-step outline and more like exploration: probing the codebase, prototyping options in worktrees, mapping the solution space. And increasingly, background agents are doing that exploration for you.
I mean, it's worth noting that a lot of plan modes are shaped to do the Socratic discovery before creating plans. For any user level. Advanced users probably put a great deal of effort (or thought) into guiding that process themselves.
> ralph loops (later on)
Ralph loops have been nothing but a dramatic mess for me, honestly. They disrupt the assessment process where humans are needed. Otherwise, don't expect them to go craft out extensive PRD without massive issues that is hard to review.
Oceania has always been context engineering. Its been interesting to see this prioritized in the zeitgeist over the last 6 months from the "long context" zeitgeist.
Good taxonomy. One thing missing from most discussions at these levels is how agents discover project context — most tools still rely on vendor-specific files (CLAUDE.md, .cursorrules). Would love to see standardization at that layer too.
One of the best article I've read recently.
> Voice-to-voice (thought-to-thought, maybe?) interaction with your coding agent — conversational Claude Code, not just voice-to-text input — is a natural next step.
Maybe it's just me, but I don't see the appeal in verbal dictation, especially where complexity is involved. I want to think through issues deliberately, carefully, and slowly to ensure I'm not glossing over subtle nuances. I don't find speaking to be conducive to that.
For me, the process of writing (and rewriting) gives me the time, space, and structure to more precisely articulate what I want with a more heightened degree of specificity. Being able to type at 80+ wpm probably helps as well.
The power of voice dictation for me is that I can get out every scrap of nuance and insight I can think of as unfiltered verbal diarrhea. Doing this gives me solidly an extra 9 in chance of getting good outputs.
Stream of consciousness typing for me is still slower and causes me to buffer and filter more and deliberately crafting a perfect prompt is far slower still.
LLMs are great at extracting the essence of unstructured inputs and voice lets me take best advantage of that.
Voice output, on the other hand, is completely useless unless perhaps it can play at 4x speed. But I need to be able to skim LLM output quickly and revisit important points repeatedly. Can't see why I'd ever want to serialize and slow that down.
>(Re: level 8) "...I honestly don't think the models are ready for this level of autonomy for most tasks. And even if they were smart enough, they're still too slow and too token-hungry for it to be economical outside of moonshot projects like compilers and browser builds (impressive, but far from clean)."
This is increasingly untrue with Opus 4.6. Claude Max gives you enough tokens to run ~5-10 agents continuously, and I'm doing all of my work with agent teams now. Token usage is up 10x or more, but the results are infinitely better and faster. Multi-agent team orchestration will be to 2026 what agents were to 2025. Much of the OP article feels 3-6 months behind the times.
What level is numeric patterns that evolve according to a sequence of arithmetic operations?