I’ve been developing an open-source version of something similar[1] and used it quite extensively (well over 1k PRs)[2]. I’m definitely believer of the “prompt to PR model”. Very liberating to not have to think about managing the agent sessions. Seems that you have built a lot of useful tooling (e.g., session videos) around this core idea.
Couple of learnings to share that I hope could be of use:
1) Execution sandboxing is just the start. For any enterprise usage you want fairly tight network egress control as well to limit chances of accidental leaks or malicious exfiltration if theres any risk of untrusted material getting into model context. Speaking as a decision maker at a tech company we do actually review stuff like this when evaluating tools.
2) Once you have proper network sandboxing, you could secure credentials much better: give agent only dummy surrogates and swap them to real creds on the way out.
3) Sandboxed agents with automatic provisioning of workspace from git can be used for more than just development tasks. In fact, it might be easier to find initial traction with a more constrained and thus predictable tasks. E.g., “ask my codebase” or “debug CI failures”.
I love the idea of emailing agents like we email humans! Thank you for sharing your learnings:
1. Network constraints vary quite a bit from one enterprise customer to another, so right now this is something we handle on a case-by-case basis with them.
2. We came to the same conclusion. For sensitive credentials like LLM API keys, we generate ephemeral keys so the real keys never touch the sandbox.
3. Totally right, we support constrained tasks too (ask mode, automated CI fixes). We've gone back and forth on whether to go vertical-first or stay generic. We're still figuring out where the sweet spot is. The constrained tasks are more reliable today, but the open-ended ones are where teams get the most leverage.
Anthropic recently killed the ability for third parties to use the Claude Code subscription, and it's assumed they're subsidising that price heavily. Which is fine, but it's a good reminder of the vendor lock-in risk. One policy change and your workflow breaks. Twill is agent-agnostic (Claude Code, Codex CLI, OpenCode), so you're not betting on any single vendor's pricing decisions.
On the cost for solo devs, yeah, if you're one person running one agent at a time on your laptop, the sub is probably the better deal today. No argument there. The cloud agent model starts to make sense when you want to fire off multiple tasks in parallel.
Yes, the difference is that Twill launches dedicated infra on each sandbox for each task. This means you can work on multiple tasks requiring a DB migration for instance.
Also you can fire and forget tasks (my favorite) and don't have to keep your laptop running at night.
The agent-agnostic approach is interesting, but I think the bigger architectural question is what happens when you move beyond code generation into domains where the agent's output has real-world consequences — booking a flight, executing a trade, dispensing medical advice.
For code, the worst case is a bad PR that gets caught in review. For domain-specific agents handling real transactions, you need a fundamentally different trust model. The LLM can't be making the decisions — it needs to be constrained to intent parsing while deterministic logic handles execution. Sandboxing the runtime (what you're doing) is necessary but not sufficient. You also need to sandbox the decision space.
Curious whether you've seen demand for non-SWE agent workloads, or if the "prompt to PR" pattern is where most of the traction is right now.
We’re focused on SWE use cases. Code is nice because there’s already a built-in verification loop: diffs, tests, CI, review, rollback. But you do quickly get to a state where the agent needs to make a risky action (db migration, or an infra operation). And this is where the permissions features from the agents are handy: allowlist, automode, etc. So you have approve/reject only the high risk actions.
And I think this risk model is valid for both technical and non-technical use cases
24/7 running coding agents are pretty clearly the direction the industry is going now. I think we'll need either on-premises or cloud solutions, since obviously if you need an agent to run 24/7 then it can't live on your laptop.
Obviously cloud is better for making money, and some kind of VPC or local cloud solution is best for enterprise, but perhaps for individual devs, a self-hosted system on a home desktop computer running 24/7 (hybrid desktop / server) would be the best solution?
> 24/7 running coding agents are pretty clearly the direction the industry is going now.
This assertion needs some support for those of us that don't have a macro insight into the industry. Are you seeing this from within FAANG shops? As a solo developer? What? Honest question.
I'm speaking from my daily experience. Sometimes i don't want to close my laptop before going to bed because there are still 1-2 tasks ongoing in my AI kanban board, so I just leave my laptop open (lock but not suspend it) so that the agents keep working for a while. I don't even have things all that automated.
I anticipate that once I have some more complex agentic scaffolds set up to do things like automatically explore promising directions for the project, then leaving the AI system on overnight becomes a necessity.
For a solo dev running one task at a time, a beefy desktop overnight is totally viable. We see a lot of this with the Mac Mini hype
Cloud starts to matter when you want to (a) run a swarm of agents on multiple independent tasks in parallel, (b) share agents across a team, or (c) not worry about keeping a machine online
I would point out that a beefy desktop is probably faster at compiling code than a typical cloud instance simply due to more CPU performance. So maybe up to 10-ish concurrent agents it's faster to use a local desktop than a cloud instance, and then you start to get into the territory where multiple agents are compiling code at the same time, and the cloud setup starts to win. (That's assuming the codebase takes a while to compile and pegs your CPU at 100% while doing so. If the codebase is faster to compile or uses fewer threads, then the breakeven agent count is even higher.)
Other than that, I agree with what you said. I don't know what the tradeoffs for local on-premises and cloud agents are in terms of other areas like convenience, but I do think that scalability in the cloud is a big advantage.
Totally right on the compile time. CIs have the same bottleneck, and the ecosystem is working on fixing this (faster cpus, better caching) in both coding agents and CI to improve overall velocity
Claude managed agents is a general-purpose hosted runtime for Claude. While Twill focuses on SWE tasks.
And so the SWE workflow is pre-built (research, planning, verification, PR, proof of work). Twill is also agnostic to the agent, so you can use codex for instance. Additionally you have more flexibility on sandbox sizing on Twill
Yes, this is the pass@k metric from code generation research. Found the relevant paper Evaluating Large Language Models Trained on Code (Chen et al., 2021) which introduced the metric.
I’ve been developing an open-source version of something similar[1] and used it quite extensively (well over 1k PRs)[2]. I’m definitely believer of the “prompt to PR model”. Very liberating to not have to think about managing the agent sessions. Seems that you have built a lot of useful tooling (e.g., session videos) around this core idea.
Couple of learnings to share that I hope could be of use:
1) Execution sandboxing is just the start. For any enterprise usage you want fairly tight network egress control as well to limit chances of accidental leaks or malicious exfiltration if theres any risk of untrusted material getting into model context. Speaking as a decision maker at a tech company we do actually review stuff like this when evaluating tools.
2) Once you have proper network sandboxing, you could secure credentials much better: give agent only dummy surrogates and swap them to real creds on the way out.
3) Sandboxed agents with automatic provisioning of workspace from git can be used for more than just development tasks. In fact, it might be easier to find initial traction with a more constrained and thus predictable tasks. E.g., “ask my codebase” or “debug CI failures”.
[1] https://airut.org [2] https://haulos.com/blog/building-agents-over-email/
Willy from Twill here.
I love the idea of emailing agents like we email humans! Thank you for sharing your learnings:
1. Network constraints vary quite a bit from one enterprise customer to another, so right now this is something we handle on a case-by-case basis with them.
2. We came to the same conclusion. For sensitive credentials like LLM API keys, we generate ephemeral keys so the real keys never touch the sandbox.
3. Totally right, we support constrained tasks too (ask mode, automated CI fixes). We've gone back and forth on whether to go vertical-first or stay generic. We're still figuring out where the sweet spot is. The constrained tasks are more reliable today, but the open-ended ones are where teams get the most leverage.
So instead of using my Claude Code subscription, I can pay the vastly higher API rates to you so you can run Claude Code for me?
Anthropic recently killed the ability for third parties to use the Claude Code subscription, and it's assumed they're subsidising that price heavily. Which is fine, but it's a good reminder of the vendor lock-in risk. One policy change and your workflow breaks. Twill is agent-agnostic (Claude Code, Codex CLI, OpenCode), so you're not betting on any single vendor's pricing decisions.
On the cost for solo devs, yeah, if you're one person running one agent at a time on your laptop, the sub is probably the better deal today. No argument there. The cloud agent model starts to make sense when you want to fire off multiple tasks in parallel.
Not sure if you've seen it yourself but Claude code can kick off parallel agents working in their own worktrees natively now. I do it all the time.
Yes, the difference is that Twill launches dedicated infra on each sandbox for each task. This means you can work on multiple tasks requiring a DB migration for instance.
Also you can fire and forget tasks (my favorite) and don't have to keep your laptop running at night.
The agent-agnostic approach is interesting, but I think the bigger architectural question is what happens when you move beyond code generation into domains where the agent's output has real-world consequences — booking a flight, executing a trade, dispensing medical advice.
For code, the worst case is a bad PR that gets caught in review. For domain-specific agents handling real transactions, you need a fundamentally different trust model. The LLM can't be making the decisions — it needs to be constrained to intent parsing while deterministic logic handles execution. Sandboxing the runtime (what you're doing) is necessary but not sufficient. You also need to sandbox the decision space.
Curious whether you've seen demand for non-SWE agent workloads, or if the "prompt to PR" pattern is where most of the traction is right now.
We’re focused on SWE use cases. Code is nice because there’s already a built-in verification loop: diffs, tests, CI, review, rollback. But you do quickly get to a state where the agent needs to make a risky action (db migration, or an infra operation). And this is where the permissions features from the agents are handy: allowlist, automode, etc. So you have approve/reject only the high risk actions. And I think this risk model is valid for both technical and non-technical use cases
Congrats on the launch, the agentbox-sdk looks interesting, but seeing as the first commit was 3 days ago - I feel a little wary to use it just yet!
One question, do you have plans for any other forms of sandboxing that are a little more "lightweight"?
Also how do you add more agent types, do you support just ACP?
Thank you! agentbox-sdk is very recent so it is not stable just yet indeed!
For the lightweight sandbox, can you give an example?
Currently we support main coding CLIs, ACP support is not shipped yet.
24/7 running coding agents are pretty clearly the direction the industry is going now. I think we'll need either on-premises or cloud solutions, since obviously if you need an agent to run 24/7 then it can't live on your laptop.
Obviously cloud is better for making money, and some kind of VPC or local cloud solution is best for enterprise, but perhaps for individual devs, a self-hosted system on a home desktop computer running 24/7 (hybrid desktop / server) would be the best solution?
> 24/7 running coding agents are pretty clearly the direction the industry is going now.
This assertion needs some support for those of us that don't have a macro insight into the industry. Are you seeing this from within FAANG shops? As a solo developer? What? Honest question.
I'm speaking from my daily experience. Sometimes i don't want to close my laptop before going to bed because there are still 1-2 tasks ongoing in my AI kanban board, so I just leave my laptop open (lock but not suspend it) so that the agents keep working for a while. I don't even have things all that automated.
I anticipate that once I have some more complex agentic scaffolds set up to do things like automatically explore promising directions for the project, then leaving the AI system on overnight becomes a necessity.
For a solo dev running one task at a time, a beefy desktop overnight is totally viable. We see a lot of this with the Mac Mini hype
Cloud starts to matter when you want to (a) run a swarm of agents on multiple independent tasks in parallel, (b) share agents across a team, or (c) not worry about keeping a machine online
I would point out that a beefy desktop is probably faster at compiling code than a typical cloud instance simply due to more CPU performance. So maybe up to 10-ish concurrent agents it's faster to use a local desktop than a cloud instance, and then you start to get into the territory where multiple agents are compiling code at the same time, and the cloud setup starts to win. (That's assuming the codebase takes a while to compile and pegs your CPU at 100% while doing so. If the codebase is faster to compile or uses fewer threads, then the breakeven agent count is even higher.)
Other than that, I agree with what you said. I don't know what the tradeoffs for local on-premises and cloud agents are in terms of other areas like convenience, but I do think that scalability in the cloud is a big advantage.
Totally right on the compile time. CIs have the same bottleneck, and the ecosystem is working on fixing this (faster cpus, better caching) in both coding agents and CI to improve overall velocity
How does this compare to something like Cursor Cloud Agents with a solid set of skills and tools?
Does it support running Docker images inside the sandbox?
Yes, for instance Twill is running a local postgres and redis directly in the sandbox using docker compose when running on our codebase.
This is what enables Twill to self verify its work before opening a PR
How does this compare to Claude Managed Agents?
Claude managed agents is a general-purpose hosted runtime for Claude. While Twill focuses on SWE tasks.
And so the SWE workflow is pre-built (research, planning, verification, PR, proof of work). Twill is also agnostic to the agent, so you can use codex for instance. Additionally you have more flexibility on sandbox sizing on Twill
> Run the same agent n times to increase success rate.
Are there benchmarks out there that back this claim?
Yes, this is the pass@k metric from code generation research. Found the relevant paper Evaluating Large Language Models Trained on Code (Chen et al., 2021) which introduced the metric.
Interesting, and how does Twill uses it in that feature?