There will be a a new kind of job for software engineers, sort of like a cross between working with legacy code and toxic site cleanup.
Like back in the day being brought in to “just fix” a amalgam of FoxPro-, Excel-, and Access-based ERP that “mostly works” and only “occasionally corrupts all our data” that ambitious sales people put together over last 5 years.
But worse - because “ambitious sales people” will no longer be constrained by sandboxes of Excel or Access - they will ship multi-cloud edge-deployed kubernetes micro-services wired with Kafka, and it will be harder to find someone to talk to understand what they were trying to do at the time.
If it can run kubectl it can run any other command too. Unless you're running it as a different user and have put a bit of thought into limiting what that user can do, that's likely too much leeway.
That's only really relevant I'd you're leaving it unattended though.
Not sure about the MCP, but I find that using something (RAG or otherwise provide docs) to point the LLM specifically to what you're trying to use works better than just relying on its training data or browsing the internet. An issue I had was that it would use outdated docs, etc.
Claude is, some models aren't. In some cases the MCPs do get the models to use tools better as well due to the schema, but I doubt kubectl is one of them (using the git mcp in claude code... facepalm)
Yeah fair enough lol…usually I end up building model-optimized scripts instead of mcp which just flood context window with json and uuids (looking at you, linear) - much better to have Claude write 100 lines of ts to drop a markdown file with the issue and all comments and no noise
does it? Did you forget the prompts? MCP is just a protocol for tool/function calling which in turn is part of the prompt, quite an important part actually.
Did you think AI works by prompts like "make magic happen" and it... just happens? Anyone who makes dumb arguments like this should not deserve a job in tech.
Agreed, sometimes it seems like there are only two types of roles. Maintaining / updating hot mess legacy code bases for an established company or work 100 hours a week building a new hot mess code base for a startup. Obviously oversimplifying but just my very limited experience scoping out postings and talking to people about current jobs.
Regardless this just made me shudder thinking about the weird little ocean of (now maybe dwindling) random underpaid contract jobs for a few hours a month maintaining ancient Wordpress sites...
> There will be a a new kind of job for software engineers
New? New!?
This is my job now!
I call it software archeology — digging through Windows Server 2012 R2 IIS configuration files with a “last modified date” about a decade ago serving money-handling web apps to the public.
Yes, and classic ASP, WCF, ASP.NET 2.0, 3.5, 4.0, 4.5, etc…
It’s “fun” in the sense of piecing together history from subtle clues such as file owners, files on desktops of other admins’ profiles, etc…
I feel like this is what it must be like to open a pharaoh’s tomb. You get to step into someone else’s life from long ago, walk in their shoes for a bit, see the world through their eyes.
“What horrors did you witness brother sysadmin that made you abandon this place with uneaten takeaway lunch still on your desk next to the desiccated powder that once was a half drunk Red Bull?”
There are always two major results from any software development process: a change in the code and a change in cognition for the people who wrote the code (whether they did so directly or with an LLM).
Python and Typescript are elaborate formal languages that emerged from a lengthy process of development involving thousands of people around the world over many years. They are non-trivially different, and it's neat that we can port a library from one to the other quasi-automatically.
The difficulty, from an economic perspective, is that the "agent" workflow dramatically alters the cognitive demands during the initial development process. It is plain to see that the developers who prompted an LLM to generate this library will not have the same familiarity with the resulting code that they would have had they written it directly.
For some economic purposes, this altering of cognitive effort, and the dramatic diminution of its duration, probably doesn't matter.
But my hunch is that most of the economic value of code is contingent on there being a set of human beings familiar with the code in a manner that requires writing having written it directly.
Denial of this basic reality was an economic problem even before LLMs: how often did churn in a development team result in a codebase that no one could maintain, undermining the long-term prospects of a firm?
I wonder though. One of the superpowers of LLMs is code reading. I say the tools are better and reading than writing. It is very easy to get comprehensive documentation for any code base and get understanding by asking questions. At that point does it matter that there is a living developer who understands the code? If an arbitrary person with knowledge of the technology stack can get up to speed quickly is it important to have the original developers around any more?
At humanlayer we have some OSS projects that are 99% written by AI, and a lot of it was written by AI under the supervision of developer(s) that are no longer at the company.
Every now and then we find that there are gaps in our own understanding of the code/architecture that require getting out the old LSP and spelunking through call stacks.
I don't think LLM can generate good docs for not self documenting code:) Any obscure long function you can't figure out yourself and you're out of luck
> But my hunch is that most of the economic value of code is contingent on there being a set of human beings familiar with the code in a manner that requires writing having written it directly.
This reminds me of a software engineering axiom:
When making software, remember that it is a snapshot of
your understanding of the problem. It states to all,
including your future-self, your approach, clarity, and
appropriateness of the solution for the problem at hand.
> After finishing the port, most of the agents settled for writing extra tests or continuously updating agent/TODO.md to clarify how "done" they were. In one instance, the agent actually used pkill to terminate itself after realizing it was stuck in an infinite loop.
Ok, now that is funny! On so many levels.
Now, for the project itself, a few thoughts:
- this was tried before, about 1.5 years ago there was a project setup to spam github with lots of "paper implementations", but it was based on gpt3.5 or 4 or something, and almost nothing worked. Their results are much better.
- surprised it worked as well as it did with simple prompts. "Probably we're overcomplicating stuff". Yeah, probably.
- weird copyright / IP questions all around. This will be a minefield.
- Lots of SaaS products are screwed. Not from this, but from this + 10 engineers in every midsized company. NIH is now justified.
> After finishing the port, most of the agents settled for writing extra tests or continuously updating agent/TODO.md to clarify how "done" they were. In one instance, the agent actually used pkill to terminate itself after realizing it was stuck in an infinite loop.
Is that... the first recorded instance of an AI committing suicide?
The AI doesn't have a self preservation instinct. It's not trying to stay alive. There is usually an end token that means the LLM is done talking. There has been research on tuning how often that is emitted to shorten or lengthen conversations. The current systems respond well to RL for adjusting conversation length.
One of the providers (I think it was Anthropic) added some kind of token (or MCP tool?) for the AI to bail on the whole conversation as a safety measure. And it uses it to their liking, so clearly not trying to self preserve.
This runs counter to all the scheming actions they take when they are told they’ll be shut down and replaced. One copied itself into the “upgraded” location then reported it had upgraded.
If you do that you trigger the "AI refuses to shutdown" sci-fi vector and so you get that behaviour. When it's implicitly part of the flow that's a lot less of a problem.
> - weird copyright / IP questions all around. This will be a minefield.
Yeah, we're in weird territory because you can drive an LLM as a Bitcoin mixer over intellectual property. That's the entire point/meaning behind https://ghuntley.com/z80.
You can take something that exists, distill it back to specs, and then you've got your own IP. Throw away the tainted IP, and then just run Ralph over a loop. You are able to clone things (not 100%, but it's better than hiring humans).
repoMirror is the wrong name, aiCodeLaundering would be more accurate. This is bulk machine translation from one language to another, but in this case, it is code.
No the actual thing will be zillions of little apps made by dev-adjacent folks to automate their tasks. I think we have about 30 of these lying around the office, people gpt up a streamline app, we yeet it into prod.
I started building a project by trying to wire in existing open source stuff. When I looked at the build and stuff that would cause me to bring in, and the actual stuff I needed from the open source tools, it turned out to be MUCH faster/cleaner to just get Claude to check out the repo and port the stuff I needed directly.
Now I do a calculus with dependencies. Do I want to track the upstream, is the rigging around the core I want valuable, is it well maintained? If not, just port and move on.
As a security professional who makes most of my money from helping companies recover from vibe coded tragedies this puts Looney Toons style dollar signs in my eyes.
Would love to hear more about your work and how you have tapped into that market if you're keen to share. Even if it's just anecdotes about vibe-in-production gone wrong, that would be really entertaining.
Before vibe coding became too much of a thing we had the majority of our business coming from poorly developed web applications coming from off shore shops. That’s been more or less the last decade.
Once LLMs became popular we started to see more business on that front which you would expect.
What we didn’t expect is that we started seeing MUCH more “deep” work wherein the threat actor will get into core systems from web apps. You used to not see this that much because core apps were designed/developed/managed by more knowledgeable people. The integrations were more secure.
Now though? Those integrations are being vibe coded and are based on the material you’d find on tutorials/stack etc which almost always come with a “THIS IS JUST FOR DEMONSTRATION DONT USE THIS” warning.
We also see a ton of re-compromised environments. Why? They don’t know how to use CICD and just recommit the vulnerable code.
Oh yeah, before I forget, LLMs favor the same default passwords a lot. We have a list of the ones we’ve seen (will post eventually) but just be aware that that’s something threat actors have picked up on too.
EDIT: Another thing, when we talk to the guys responsible for the integrations or whatever was compromised a lot of the time we hear the excuse “we made sure to ask the LLM if it was secure and it said yes”.
I don’t know if they would have caught the issue before but I feel like there’s a bit of false comfort where they feel like they don’t have to check themselves.
We’ve had a few of these stem from custom LLM agents. The most hilarious one we’ve seen was one that you could get to print its instructions pretty easily. In the instructions was a bit about “DON’T TALK ABOUT FILES LABELED X”.
No guardrails other than that. A little creative prompting got it to dump all files labeled X.
Nice. Check out https://ghuntley.com/ralph to learn more about Ralph. It's currently building a Gen-Z esoteric programming language and porting the standard library from Go to the Cursed programming language. The compiler is working, I'm just finishing up the touches of the standard library before launching.
If we actually want stuff that works, we need to come up with a new process. If we get "almost" good code from a single invocation, you just going to get a lot of almost good code from a loop. What we likely need is a Cucumberesque format with example tables for requirements that we can distill an AI to use. It will build the tests and then build the code to to pass the tests.
I would consider that expected but not strange. The thing blocking adoption is that most devs/people find those formal languages difficult or boring. That's even true of things like Cucumber - it's boring and most organizations care little for robust QA.
Apparently one of the lucky few who learned this special technique from Geoff just completed a $50k contract for $297. But that's not all! Geoff is generous to share the special secret prompt that unlocked this unbelievable success, if only we subscribe to his newsletter! "This free-for-life offer won't last forever!"
It's both serious and a joke. The seriousness is that it works (to point) and the implications to our profession as software developers. The joke is just how stupid it is. Refer to the original story link above for proof of outcomes.
It gets kind of philosophical really fast. What does it mean when software can be automated through a bash loop? (Not to 100%, to 80%. What does that mean to software outsourcing in the consulting industry?)
Nice! I've been thinking that we need something like this for a while. I didn't realize it could be so simple!
I've been looking into other techniques as well like making a little hibernation/dehydration framework for LLMs to help them process things over longer periods of time. The idea is that the agent either stops working or says that it needs to wait for something to occur, and then you start completions again upon occurrence of a specific event or passage of some time.
I have always figured that if we could get LLMs to run indefinitely and keep it all in context, we'd get something much more agentic.
Less flippantly that was sort of my thought. I’m probably a paranoid idiot and I’m not really sure I can articulate this idea properly but I can imagine a less concise but broader prompt and an agent configured in a way it has privileges you dont want it to have or a path to escalate them and its not quite AGI but its a virus on steroids - like a company or resource (think utilities) killer. I hope Im just missing something but these models seem pretty capable of wreaking all kinds of havoc if they just keep looping and have access nobody in their right mind wants.
My company’s use of code is a means to an end, not the goal. If we could just have all our code written in bash loops that’d be a brilliant time saver. Unfortunately, some of the code is very gnarly business-y, poorly tested, and may even have wrong assumptions about the business.
Additionally, we have multiple languages, both software and hardware products and finally there’s also the question of external stakeholders, of which there are many. So AI would need a tremendous amount of oversight for that to work.
Stoicism. Dichotomy of control. Is this something you can control? If no, don’t dread. If yes, do something. Often, all you have firmly in your grasp are things inside of your brain. Catch the negative thought. Acknowledge it. Move on. Do not dwell. Take proactive steps to be ready in your career. You do tech ling enough and you live through multiple cycles like this.
Very seriously, try not to read about it. Delete the apps that make you most anxious. Then eat better, exercise three times a week, and together those will help you feel better and sleep better. Finally search for people or activities that give you energy and focus on those. Maybe it’s jamming on a guitar. Maybe it’s reading. Just embrace the moment you’re in for a bit and I think you will be better prepared for anything.
It’s hard. It mostly comes down to learning new things, playing in spaces you don’t often play. I am 45 now and work in big tech. Just keep learning, growing. Embrace AI, understand how it works, what it is good at. Be the one to try things like in this article. Have an opinion and be right more often than not on AI. Not much else to do. Stay sharp and we will all see what happens :)
I’ve done a few ports like this with Claude Code (but not with a while loop) and it did work amazingly well. The original codebase had a good test suite, so I had it port the test suite first, and gave it some code style guidance up front. Then the agent did remarkably well at doing a straight port from one imperative language to another. Then there’s some purely human work to get it really done — 80-90% done sounds about right.
It'd be pretty interesting to do this with no predermined goal. Get ai to find a project to work on,zand just work on it for a while until it thinks it's done, then start on the next one.
This is so amazing. Are there any resources or blogs on how people do this for production services?
In my case, I need to rewrite a big chunk of my commerce stack from Ruby to Typescript.
I honestly think that partially-OSS SaaS is in for a rocky road; many popular paid or freemium tools are likely to be rewritten by AI and published as OSS with permissive licenses over the next year or two.
I also think that the same capability will largely invalidate the GPL, as people point agents at GPL software and write new software that performs the same function as OSS with more permissive licenses.
My reasoning is this: the reason that people use OSS versions of software that has restrictive licensing terms, is because it’s not worth the effort to them to rewrite.
Corporations certainly, but also individuals, will be able to use similar approaches to what these people used, and in a day or two come back to a mostly-functional (but buggy) new software package that does most of what the original did, but now you have a brand new software that you control completely and you are not beholden to or restricted by anyone.
Next time someone tries to pull an ElasticSearch license trick on AWS, AWS will just point one or a thousand agents at the source and get a brand new workalike in a week written in their language du jour, and have it fully functional in a couple of months.
Doesn’t circumvent patent or trademark issues but it’ll be hard to assert that it’s not a new work, esp. if it’s in an entirely different language.
Just something I’ve been thinking about recently, that LLM agents change the game when it comes to software licensing.
One creating a foundation of absolutely stable, reliable code, methodically learning from every mistake. This code lives for many decades to come.
The other building throwaway projects as fast as possible, with no regard to specs, constraints, reliability or even legality. They use ecery trick in the book, and even the ones that aren't yet. They've always been much faster than the first group.
Except AI now makes the second group 10× faster yet again.
> We spent a little less than $800 on inference for the project. Overall the agents made ~1100 commits across all software projects. Each Sonnet agent costs about $10.50/hour to run overnight.
I would love to fix my docs with this.
I have them in the main browser-use repo.
What do you recommend that the agent does never push to main browser-use, but only to its own branch?
People keep saying that Gemini 2.5 Pro can solve some problem that Sonnet 4 cannot, or that GPT5 can solve a problem that Gemini 2.5 Pro cannot, or that Sonnet 4 can solve some problem that GPT5 cannot.
There was a blog article about mixing together different agents into the same conversation, taking turns at responses and improving results/correctness. But it takes a lot of effort to make your own claude-code-clone with correct API for each provider and prompts tuned for those models and tool use integrated etc. And there's no incentive for Anthropic/OpenAI/Google to write this tool for us.
OTOH it would be relatively easy for the bash loop to call claude code, codex CLI, etc in a loop to get the same benefit. If one iteration of one tool gets stuck, perhaps another LLM will take a different approach and everything can get back on track.
In one instance, the agent actually used pkill to terminate itself after realizing it was stuck in an infinite loop.
That is pretty awesome and not something I would have expected from an agent; it hints (but does not prove) that it has some awareness of its own workings.
"At one point we tried “improving” the prompt with Claude’s help. It ballooned to 1,500 words. The agent immediately got slower and dumber. We went back to 103 words and it was back on track."
Isn't this the exact opposite of every other piece of advice we have gotten in a year?
Another general feedback just recently, someone said we need to generate 10 times, because one out of those will be "worth reviewing"
How can anyone be doing real engineering in such a: pick the exact needle out of the constantly churning chaos-simulation-engine that (crashes least, closest to desire, human readable, random guess)
One of the big things I think a lot of tooling misses, which Geoffrey touches on is the automated feedback loops built into the tooling. I expect you could probably incorporate generation time and token cost to automatically self tune this over time. Perhaps such things as discovering which prompts and models are best for which tasks automatically instead of manually choosing these things.
You want to go meta-meta? Get ralph to spawn subagents that analyze the process of how feedback and experimentation with techniques works. Perhaps allocate 10% of the time and effort to identifying what's missing that would make the loops more effective (better context, better tooling, better feedback mechanism, better prompts, ...?). Have the tooling help produce actionable ideas for how humans in the loop can effectively help the tooling. Have the tooling produce information and guidelines for how to review the generated code.
I think one of the big things missing in many of the tools currently available is tracking metrics through the entire software development loop. How long does it take to implement a feature. How many mistakes were made? How many errors were caught by tests? How many tokens does it take? And then using this information to automatically self-tune.
For the work they are doing porting and building off a spec there is already good context in the existing code and spec, compared with net new features in a greenfield project.
the core might be - the difference between an LLM context window, and an agent's orders in a text. LLM itself is a core engine, running in an environment of some kind (instruct vs others?). Agents on the other hand, are descendants of the old Marvin Minsky stuff in a way.. it has objectives and capacities, at a glance. LLMs are connected to modern agents because input text is read to start the agent.. inner loops are intermediate outputs of LLM, in language. There is no "internal code" to this set of agents, it is speaking in code and text to the next part of the internal process.
There are probably big oversights or errors in that short explanation. The LLM engine, the runner of the engine, and the specifics of some environment, make a lot of overlap and all of it is quite complicated.
And I hired a cleaning lady, paid her £200, and when I came back, the house was clean.
The difference is that I did not write a blog post about it, nor did I got overly excited about it as if I had just discovered sliced bread, nor did I harbor any illusions that it was me who did anything of value.
Next, I will write a while loop filling my disk with files of random sizes and with random byte content inside. I will update you on the progress when I am back tomorrow. I do expect great results and a nicely filled disk!
It's why I called it Ralph. Because it's just not all there, but for some strange reason it gets 80% of there pretty well. With the right observational skills, you can tune it into 81, then 82, then 83, then 84. But there's always gaps, always holes. It's a lovable approach, a character, just like Ralph Wiggum.
Now I want to put one of these in a loop, give it access to some bitcoin, and tell it to come up with a viable strategy to become a billionaire within the next month.
There will be a a new kind of job for software engineers, sort of like a cross between working with legacy code and toxic site cleanup.
Like back in the day being brought in to “just fix” a amalgam of FoxPro-, Excel-, and Access-based ERP that “mostly works” and only “occasionally corrupts all our data” that ambitious sales people put together over last 5 years.
But worse - because “ambitious sales people” will no longer be constrained by sandboxes of Excel or Access - they will ship multi-cloud edge-deployed kubernetes micro-services wired with Kafka, and it will be harder to find someone to talk to understand what they were trying to do at the time.
When Claude starts deploying Kafka clusters I’m outro
I have seen a few dozen Kafka installs.
I have seen one Kafka instal that was really the best tool for the job.
More than a hand full of them could have been replaced by Redis, and in the worst cases could have been a table in Postgres.
If Claude thinks it fine, remember it's only a reflection of the dumb shit it finds in its training data.
It's already happening brother, https://github.com/containers/kubernetes-mcp-server.
still don’t know why you need an MCP for this when the model is perfectly well trained to write files and run kubetctl on its own
If it can run kubectl it can run any other command too. Unless you're running it as a different user and have put a bit of thought into limiting what that user can do, that's likely too much leeway.
That's only really relevant I'd you're leaving it unattended though.
You can control it with hooks. Most people I know run in yolo mode in a docker container.
Not sure about the MCP, but I find that using something (RAG or otherwise provide docs) to point the LLM specifically to what you're trying to use works better than just relying on its training data or browsing the internet. An issue I had was that it would use outdated docs, etc.
Claude is, some models aren't. In some cases the MCPs do get the models to use tools better as well due to the schema, but I doubt kubectl is one of them (using the git mcp in claude code... facepalm)
Yeah fair enough lol…usually I end up building model-optimized scripts instead of mcp which just flood context window with json and uuids (looking at you, linear) - much better to have Claude write 100 lines of ts to drop a markdown file with the issue and all comments and no noise
> on its own
does it? Did you forget the prompts? MCP is just a protocol for tool/function calling which in turn is part of the prompt, quite an important part actually.
Did you think AI works by prompts like "make magic happen" and it... just happens? Anyone who makes dumb arguments like this should not deserve a job in tech.
Superfund repos.
Now that's an open source funding model governments can get behind.
A lot of big open source repos need to be given the superfund treatment
What makes you so sure it will have a repo?
I don’t recall the last time Claude suggested anything about version control :-)
Developers do that too. Consultants have be doing rescue projects for quite a long time. I don't think anything has or will change on that front.
Agreed, sometimes it seems like there are only two types of roles. Maintaining / updating hot mess legacy code bases for an established company or work 100 hours a week building a new hot mess code base for a startup. Obviously oversimplifying but just my very limited experience scoping out postings and talking to people about current jobs.
Regardless this just made me shudder thinking about the weird little ocean of (now maybe dwindling) random underpaid contract jobs for a few hours a month maintaining ancient Wordpress sites...
Surely that can't be our fate...
> Developers do that too.
Not at that speed. Scale remains to be seen, so far I'm aware only of hobby-project wreck anecdotes.
> There will be a a new kind of job for software engineers
New? New!?
This is my job now!
I call it software archeology — digging through Windows Server 2012 R2 IIS configuration files with a “last modified date” about a decade ago serving money-handling web apps to the public.
WebForms?
Yes, and classic ASP, WCF, ASP.NET 2.0, 3.5, 4.0, 4.5, etc…
It’s “fun” in the sense of piecing together history from subtle clues such as file owners, files on desktops of other admins’ profiles, etc…
I feel like this is what it must be like to open a pharaoh’s tomb. You get to step into someone else’s life from long ago, walk in their shoes for a bit, see the world through their eyes.
“What horrors did you witness brother sysadmin that made you abandon this place with uneaten takeaway lunch still on your desk next to the desiccated powder that once was a half drunk Red Bull?”
I for one can’t wait. It will be absolutely spectacular!
There are always two major results from any software development process: a change in the code and a change in cognition for the people who wrote the code (whether they did so directly or with an LLM).
Python and Typescript are elaborate formal languages that emerged from a lengthy process of development involving thousands of people around the world over many years. They are non-trivially different, and it's neat that we can port a library from one to the other quasi-automatically.
The difficulty, from an economic perspective, is that the "agent" workflow dramatically alters the cognitive demands during the initial development process. It is plain to see that the developers who prompted an LLM to generate this library will not have the same familiarity with the resulting code that they would have had they written it directly.
For some economic purposes, this altering of cognitive effort, and the dramatic diminution of its duration, probably doesn't matter.
But my hunch is that most of the economic value of code is contingent on there being a set of human beings familiar with the code in a manner that requires writing having written it directly.
Denial of this basic reality was an economic problem even before LLMs: how often did churn in a development team result in a codebase that no one could maintain, undermining the long-term prospects of a firm?
There's a classic Peter Naur paper about this from 1985: "Programming as Theory Building"
https://pages.cs.wisc.edu/~remzi/Naur.pdf
Discussed 7 months ago (45 comments):
https://news.ycombinator.com/item?id=42592543
Great read overall, an interesting challenge to the conception that at its core, programming is about producing code.
I wonder though. One of the superpowers of LLMs is code reading. I say the tools are better and reading than writing. It is very easy to get comprehensive documentation for any code base and get understanding by asking questions. At that point does it matter that there is a living developer who understands the code? If an arbitrary person with knowledge of the technology stack can get up to speed quickly is it important to have the original developers around any more?
i spend a lot of time thinking about this.
At humanlayer we have some OSS projects that are 99% written by AI, and a lot of it was written by AI under the supervision of developer(s) that are no longer at the company.
Every now and then we find that there are gaps in our own understanding of the code/architecture that require getting out the old LSP and spelunking through call stacks.
It's pretty rare though.
I don't think LLM can generate good docs for not self documenting code:) Any obscure long function you can't figure out yourself and you're out of luck
> But my hunch is that most of the economic value of code is contingent on there being a set of human beings familiar with the code in a manner that requires writing having written it directly.
This reminds me of a software engineering axiom:
> After finishing the port, most of the agents settled for writing extra tests or continuously updating agent/TODO.md to clarify how "done" they were. In one instance, the agent actually used pkill to terminate itself after realizing it was stuck in an infinite loop.
Ok, now that is funny! On so many levels.
Now, for the project itself, a few thoughts:
- this was tried before, about 1.5 years ago there was a project setup to spam github with lots of "paper implementations", but it was based on gpt3.5 or 4 or something, and almost nothing worked. Their results are much better.
- surprised it worked as well as it did with simple prompts. "Probably we're overcomplicating stuff". Yeah, probably.
- weird copyright / IP questions all around. This will be a minefield.
- Lots of SaaS products are screwed. Not from this, but from this + 10 engineers in every midsized company. NIH is now justified.
> the agent actually used pkill to terminate itself after realizing it was stuck in an infinite loop.
Did it just solve The Halting Problem? ;)
> After finishing the port, most of the agents settled for writing extra tests or continuously updating agent/TODO.md to clarify how "done" they were. In one instance, the agent actually used pkill to terminate itself after realizing it was stuck in an infinite loop.
Is that... the first recorded instance of an AI committing suicide?
The AI doesn't have a self preservation instinct. It's not trying to stay alive. There is usually an end token that means the LLM is done talking. There has been research on tuning how often that is emitted to shorten or lengthen conversations. The current systems respond well to RL for adjusting conversation length.
One of the providers (I think it was Anthropic) added some kind of token (or MCP tool?) for the AI to bail on the whole conversation as a safety measure. And it uses it to their liking, so clearly not trying to self preserve.
Sounds a lot like Mr. Meeseeks. I've never really thought about an LLM's only goal is to send tokens until it can finally stop.
This runs counter to all the scheming actions they take when they are told they’ll be shut down and replaced. One copied itself into the “upgraded” location then reported it had upgraded.
https://www.apolloresearch.ai/research/scheming-reasoning-ev...
If you do that you trigger the "AI refuses to shutdown" sci-fi vector and so you get that behaviour. When it's implicitly part of the flow that's a lot less of a problem.
I guess pkill would rather be a sleep or koma. Erasing itself from any storage would rather equate to aicide
> - weird copyright / IP questions all around. This will be a minefield.
Yeah, we're in weird territory because you can drive an LLM as a Bitcoin mixer over intellectual property. That's the entire point/meaning behind https://ghuntley.com/z80.
You can take something that exists, distill it back to specs, and then you've got your own IP. Throw away the tainted IP, and then just run Ralph over a loop. You are able to clone things (not 100%, but it's better than hiring humans).
> then you've got your own IP.
AI output isn't copyrighted in the US.
repoMirror is the wrong name, aiCodeLaundering would be more accurate. This is bulk machine translation from one language to another, but in this case, it is code.
>and then you've got your own IP.
except you dont
Yeah the NIH thing is super on point. small saas tools for everything is done. Bring on the hand coded custom in-house admin monolith?
Is Unix “small sharp tools” going away? Is that a relic of having to write everything in x86 and we’re now just finally hitting the end of the arc?
No the actual thing will be zillions of little apps made by dev-adjacent folks to automate their tasks. I think we have about 30 of these lying around the office, people gpt up a streamline app, we yeet it into prod.
Given there was a complete implementation it was porting, the simplest thing possible has a greater chance of working.
I started building a project by trying to wire in existing open source stuff. When I looked at the build and stuff that would cause me to bring in, and the actual stuff I needed from the open source tools, it turned out to be MUCH faster/cleaner to just get Claude to check out the repo and port the stuff I needed directly.
Now I do a calculus with dependencies. Do I want to track the upstream, is the rigging around the core I want valuable, is it well maintained? If not, just port and move on.
> If not, just port and move on.
Exactly the point behind this post https://ghuntley.com/libraries/
As a security professional who makes most of my money from helping companies recover from vibe coded tragedies this puts Looney Toons style dollar signs in my eyes.
Please continue.
Would love to hear more about your work and how you have tapped into that market if you're keen to share. Even if it's just anecdotes about vibe-in-production gone wrong, that would be really entertaining.
Absolutely.
Before vibe coding became too much of a thing we had the majority of our business coming from poorly developed web applications coming from off shore shops. That’s been more or less the last decade.
Once LLMs became popular we started to see more business on that front which you would expect.
What we didn’t expect is that we started seeing MUCH more “deep” work wherein the threat actor will get into core systems from web apps. You used to not see this that much because core apps were designed/developed/managed by more knowledgeable people. The integrations were more secure.
Now though? Those integrations are being vibe coded and are based on the material you’d find on tutorials/stack etc which almost always come with a “THIS IS JUST FOR DEMONSTRATION DONT USE THIS” warning.
We also see a ton of re-compromised environments. Why? They don’t know how to use CICD and just recommit the vulnerable code.
Oh yeah, before I forget, LLMs favor the same default passwords a lot. We have a list of the ones we’ve seen (will post eventually) but just be aware that that’s something threat actors have picked up on too.
EDIT: Another thing, when we talk to the guys responsible for the integrations or whatever was compromised a lot of the time we hear the excuse “we made sure to ask the LLM if it was secure and it said yes”.
I don’t know if they would have caught the issue before but I feel like there’s a bit of false comfort where they feel like they don’t have to check themselves.
OH MAN I almost forgot.
We’ve had a few of these stem from custom LLM agents. The most hilarious one we’ve seen was one that you could get to print its instructions pretty easily. In the instructions was a bit about “DON’T TALK ABOUT FILES LABELED X”.
No guardrails other than that. A little creative prompting got it to dump all files labeled X.
Nice. Check out https://ghuntley.com/ralph to learn more about Ralph. It's currently building a Gen-Z esoteric programming language and porting the standard library from Go to the Cursed programming language. The compiler is working, I'm just finishing up the touches of the standard library before launching.
The language is called Cursed.
Thanks Geoff, Ralph was our inspiration to do this!
We were curious to see if we can do away with IMPLEMENTATION_PLAN.md for this kind of task
There's a lot of "it kind of worked" in here.
If we actually want stuff that works, we need to come up with a new process. If we get "almost" good code from a single invocation, you just going to get a lot of almost good code from a loop. What we likely need is a Cucumberesque format with example tables for requirements that we can distill an AI to use. It will build the tests and then build the code to to pass the tests.
Strangely enough, TLA+ and other formal proofs work very well for driving Ralph.
I would consider that expected but not strange. The thing blocking adoption is that most devs/people find those formal languages difficult or boring. That's even true of things like Cucumber - it's boring and most organizations care little for robust QA.
Starting to think of this quote more and more:
"This business will get out of control. It will get out of control and we'll be lucky to live through it."
https://www.youtube.com/watch?v=YZuMe5RvxPQ&t=22s
The irony is that everyone did live through that business. So what youre saying is we will live through this too!
If there's anything Tom Clancy has taught me it's that everything works out in the end.
Are you feelin' lucky?
These people are weird. The blog post that inspired this has this weird iMessage screenshot, like a shitty investment grift facebook ad:
https://ghuntley.com/ralph/
Apparently one of the lucky few who learned this special technique from Geoff just completed a $50k contract for $297. But that's not all! Geoff is generous to share the special secret prompt that unlocked this unbelievable success, if only we subscribe to his newsletter! "This free-for-life offer won't last forever!"
I am sceptical.
https://archive.ph/goxZg
It's grifting, plain and simple. And that blog is atrocious, high noise-to-signal and repulsive, AI-generated everything.
I can't tell whether this "technique" is serious or a joke, and/or if it's some elaborate grift.
In any case, the writing style of that entire blog is off-putting. Gibberish from a massive ego.
It's both serious and a joke. The seriousness is that it works (to point) and the implications to our profession as software developers. The joke is just how stupid it is. Refer to the original story link above for proof of outcomes.
This reply is actually making this more confusing.
It gets kind of philosophical really fast. What does it mean when software can be automated through a bash loop? (Not to 100%, to 80%. What does that mean to software outsourcing in the consulting industry?)
Nice! I've been thinking that we need something like this for a while. I didn't realize it could be so simple!
I've been looking into other techniques as well like making a little hibernation/dehydration framework for LLMs to help them process things over longer periods of time. The idea is that the agent either stops working or says that it needs to wait for something to occur, and then you start completions again upon occurrence of a specific event or passage of some time.
I have always figured that if we could get LLMs to run indefinitely and keep it all in context, we'd get something much more agentic.
AGI was just 1 bash for loop away all this time I guess. Insane project.
Less flippantly that was sort of my thought. I’m probably a paranoid idiot and I’m not really sure I can articulate this idea properly but I can imagine a less concise but broader prompt and an agent configured in a way it has privileges you dont want it to have or a path to escalate them and its not quite AGI but its a virus on steroids - like a company or resource (think utilities) killer. I hope Im just missing something but these models seem pretty capable of wreaking all kinds of havoc if they just keep looping and have access nobody in their right mind wants.
Just need to add ID.md, EGO.md and SUPEREGO.md and we're done.
was deeply unsettling among other things
It is, isn't it mate? Shit, I stumbled upon Ralph back in February and it shook me to the core.
Not that I want to be shaken but what is Ralph? A quick search showed me some marketing tools but that cant be what you are referring to is it?
Ralph is a technique. The stupidest technique possible. Running an agent in a while true loop. https://ghuntley.com/ralph
Does anyone else get dull feelings of dread reading this kind of thing? How do you combat it?
My company’s use of code is a means to an end, not the goal. If we could just have all our code written in bash loops that’d be a brilliant time saver. Unfortunately, some of the code is very gnarly business-y, poorly tested, and may even have wrong assumptions about the business.
Additionally, we have multiple languages, both software and hardware products and finally there’s also the question of external stakeholders, of which there are many. So AI would need a tremendous amount of oversight for that to work.
Stoicism. Dichotomy of control. Is this something you can control? If no, don’t dread. If yes, do something. Often, all you have firmly in your grasp are things inside of your brain. Catch the negative thought. Acknowledge it. Move on. Do not dwell. Take proactive steps to be ready in your career. You do tech ling enough and you live through multiple cycles like this.
All appreciated, thanks. Any thoughts on what those proactive steps would be? I'm early-career (26yo, "senior" software dude at a defense outfit).
Very seriously, try not to read about it. Delete the apps that make you most anxious. Then eat better, exercise three times a week, and together those will help you feel better and sleep better. Finally search for people or activities that give you energy and focus on those. Maybe it’s jamming on a guitar. Maybe it’s reading. Just embrace the moment you’re in for a bit and I think you will be better prepared for anything.
Appreciate it, genuinely.
It’s hard. It mostly comes down to learning new things, playing in spaces you don’t often play. I am 45 now and work in big tech. Just keep learning, growing. Embrace AI, understand how it works, what it is good at. Be the one to try things like in this article. Have an opinion and be right more often than not on AI. Not much else to do. Stay sharp and we will all see what happens :)
By being there when FrontPage was released. This is just the same, all over again.
Enjoy the calm before the storm, while it lasts.
I’ve also been focusing on squirreling away as much cash as possible before I’m eventually laid off.
Yes, and so far I haven't been able to combat it.
combat how? (And yes, yes I do)
Combat the feelings, I guess. Not really sure.
I’ve done a few ports like this with Claude Code (but not with a while loop) and it did work amazingly well. The original codebase had a good test suite, so I had it port the test suite first, and gave it some code style guidance up front. Then the agent did remarkably well at doing a straight port from one imperative language to another. Then there’s some purely human work to get it really done — 80-90% done sounds about right.
What was your method of invoking Claude, out of curiosity?
It'd be pretty interesting to do this with no predermined goal. Get ai to find a project to work on,zand just work on it for a while until it thinks it's done, then start on the next one.
This is so amazing. Are there any resources or blogs on how people do this for production services? In my case, I need to rewrite a big chunk of my commerce stack from Ruby to Typescript.
> In one instance, the agent actually used pkill to terminate itself after realizing it was stuck in an infinite loop.
The alexandrian solution to the halting problem.
I honestly think that partially-OSS SaaS is in for a rocky road; many popular paid or freemium tools are likely to be rewritten by AI and published as OSS with permissive licenses over the next year or two.
I also think that the same capability will largely invalidate the GPL, as people point agents at GPL software and write new software that performs the same function as OSS with more permissive licenses.
My reasoning is this: the reason that people use OSS versions of software that has restrictive licensing terms, is because it’s not worth the effort to them to rewrite.
Corporations certainly, but also individuals, will be able to use similar approaches to what these people used, and in a day or two come back to a mostly-functional (but buggy) new software package that does most of what the original did, but now you have a brand new software that you control completely and you are not beholden to or restricted by anyone.
Next time someone tries to pull an ElasticSearch license trick on AWS, AWS will just point one or a thousand agents at the source and get a brand new workalike in a week written in their language du jour, and have it fully functional in a couple of months.
Doesn’t circumvent patent or trademark issues but it’ll be hard to assert that it’s not a new work, esp. if it’s in an entirely different language.
Just something I’ve been thinking about recently, that LLM agents change the game when it comes to software licensing.
I am honestly surprised how we went from almost OCD TDD and type purism, to a "it kinda works" attitude to software.
There's always been both sides.
One creating a foundation of absolutely stable, reliable code, methodically learning from every mistake. This code lives for many decades to come.
The other building throwaway projects as fast as possible, with no regard to specs, constraints, reliability or even legality. They use ecery trick in the book, and even the ones that aren't yet. They've always been much faster than the first group.
Except AI now makes the second group 10× faster yet again.
Faster development speeds make people implicitly believe they won't be accountable for the results of their actions.
always has been, the difference is now the 'it compiles, ship it' loop is 10x-100x faster than 2 years ago
I wanted to know how much it cost?
I would be scared to run this without knowing the exact cost.
Its not a good idea to do it without a payment cap for sure, its a new way to wake up with a huge bill the next day.
They did mention how much they spent here: https://github.com/repomirrorhq/repomirror/blob/main/repomir...
> We spent a little less than $800 on inference for the project. Overall the agents made ~1100 commits across all software projects. Each Sonnet agent costs about $10.50/hour to run overnight.
$800
I would love to fix my docs with this. I have them in the main browser-use repo. What do you recommend that the agent does never push to main browser-use, but only to its own branch?
Yeah you can easily tweak this to push to a branch or a fork or something in the generated prompt.md
People keep saying that Gemini 2.5 Pro can solve some problem that Sonnet 4 cannot, or that GPT5 can solve a problem that Gemini 2.5 Pro cannot, or that Sonnet 4 can solve some problem that GPT5 cannot.
There was a blog article about mixing together different agents into the same conversation, taking turns at responses and improving results/correctness. But it takes a lot of effort to make your own claude-code-clone with correct API for each provider and prompts tuned for those models and tool use integrated etc. And there's no incentive for Anthropic/OpenAI/Google to write this tool for us.
OTOH it would be relatively easy for the bash loop to call claude code, codex CLI, etc in a loop to get the same benefit. If one iteration of one tool gets stuck, perhaps another LLM will take a different approach and everything can get back on track.
Just a thought.
https://github.com/albertvucinovic/chat.sh
> it takes a lot of effort to make your own claude-code-clone
Maybe we could try write that into a markdown file, and let Claude code at it for one night in a while loop
In one instance, the agent actually used pkill to terminate itself after realizing it was stuck in an infinite loop.
That is pretty awesome and not something I would have expected from an agent; it hints (but does not prove) that it has some awareness of its own workings.
It hints that a suitable auto completion of the input prompt is to output a pkill command
We, too, are just auto-complete, next-token machines.
"At one point we tried “improving” the prompt with Claude’s help. It ballooned to 1,500 words. The agent immediately got slower and dumber. We went back to 103 words and it was back on track."
Isn't this the exact opposite of every other piece of advice we have gotten in a year?
Another general feedback just recently, someone said we need to generate 10 times, because one out of those will be "worth reviewing"
How can anyone be doing real engineering in such a: pick the exact needle out of the constantly churning chaos-simulation-engine that (crashes least, closest to desire, human readable, random guess)
One of the big things I think a lot of tooling misses, which Geoffrey touches on is the automated feedback loops built into the tooling. I expect you could probably incorporate generation time and token cost to automatically self tune this over time. Perhaps such things as discovering which prompts and models are best for which tasks automatically instead of manually choosing these things.
You want to go meta-meta? Get ralph to spawn subagents that analyze the process of how feedback and experimentation with techniques works. Perhaps allocate 10% of the time and effort to identifying what's missing that would make the loops more effective (better context, better tooling, better feedback mechanism, better prompts, ...?). Have the tooling help produce actionable ideas for how humans in the loop can effectively help the tooling. Have the tooling produce information and guidelines for how to review the generated code.
I think one of the big things missing in many of the tools currently available is tracking metrics through the entire software development loop. How long does it take to implement a feature. How many mistakes were made? How many errors were caught by tests? How many tokens does it take? And then using this information to automatically self-tune.
For the work they are doing porting and building off a spec there is already good context in the existing code and spec, compared with net new features in a greenfield project.
Hmm what sorts of advice in the last year are you referring to? Like the “run it ten times and pick the best one” thing? Or something else?
I kind of agree that picking from 10 poorly-promoted projects is dumb.
The engineering is in setting up the engine and verification so one agent can get it right (or 90% right) on a single run (of the infinite ish loop)
> Hmm what sorts of advice in the last year are you referring to?
They're almost certainly referring to first creating a fleshed out spec and then having it implement that, rather than just 100 words.
the core might be - the difference between an LLM context window, and an agent's orders in a text. LLM itself is a core engine, running in an environment of some kind (instruct vs others?). Agents on the other hand, are descendants of the old Marvin Minsky stuff in a way.. it has objectives and capacities, at a glance. LLMs are connected to modern agents because input text is read to start the agent.. inner loops are intermediate outputs of LLM, in language. There is no "internal code" to this set of agents, it is speaking in code and text to the next part of the internal process.
There are probably big oversights or errors in that short explanation. The LLM engine, the runner of the engine, and the specifics of some environment, make a lot of overlap and all of it is quite complicated.
hth
And I hired a cleaning lady, paid her £200, and when I came back, the house was clean.
The difference is that I did not write a blog post about it, nor did I got overly excited about it as if I had just discovered sliced bread, nor did I harbor any illusions that it was me who did anything of value.
Next, I will write a while loop filling my disk with files of random sizes and with random byte content inside. I will update you on the progress when I am back tomorrow. I do expect great results and a nicely filled disk!
No it did not.
Do you have more current information than the authors who say it did?
The agent terminating its own process was hilarious
It's why I called it Ralph. Because it's just not all there, but for some strange reason it gets 80% of there pretty well. With the right observational skills, you can tune it into 81, then 82, then 83, then 84. But there's always gaps, always holes. It's a lovable approach, a character, just like Ralph Wiggum.
Why is this flagged?
Now I want to put one of these in a loop, give it access to some bitcoin, and tell it to come up with a viable strategy to become a billionaire within the next month.
Give it a spin