For a weekend project or local use case, your inductive reasoning (probabilistically, no one is spending a 0-day on me) is totally fine. But the moment you move to enterprise, fintech, or any system handling real data, I truly believe that relying on induction is a non-starter.
The deductive risk (the fact that the agent can execute rm -rf or transfer funds if prompted maliciously) is actually very common. I am working with one of the top universities in my country to write a paper about that issue. We benchmarked 118 test scenarios, 1,062 API calls across GPT-4o, Claude Sonnet, and Gemini Flash. they all fail to consistently follow their own guardrails. The results will be published by if you are interested here are the charts:
We shouldn't have to choose between crippling the agent's capabilities and just hoping we don't get targeted. And I really believe that the solution is putting a deterministic governance layer between the agent and the execution environment.
This is actually why I started building a product according to that and I just published a Show HN here. You can check it out and if you are interested I can give you credit on my platform which is dedicated to restrict unsafe behaviors and decisions of AI agents.
For a weekend project or local use case, your inductive reasoning (probabilistically, no one is spending a 0-day on me) is totally fine. But the moment you move to enterprise, fintech, or any system handling real data, I truly believe that relying on induction is a non-starter.
The deductive risk (the fact that the agent can execute rm -rf or transfer funds if prompted maliciously) is actually very common. I am working with one of the top universities in my country to write a paper about that issue. We benchmarked 118 test scenarios, 1,062 API calls across GPT-4o, Claude Sonnet, and Gemini Flash. they all fail to consistently follow their own guardrails. The results will be published by if you are interested here are the charts:
https://github.com/akarlaraytu/llm-agent-policy-enforcement
We shouldn't have to choose between crippling the agent's capabilities and just hoping we don't get targeted. And I really believe that the solution is putting a deterministic governance layer between the agent and the execution environment.
This is actually why I started building a product according to that and I just published a Show HN here. You can check it out and if you are interested I can give you credit on my platform which is dedicated to restrict unsafe behaviors and decisions of AI agents.
https://news.ycombinator.com/item?id=47501849