Interesting but I'm a bit lost. You are optimising but how do you know the ground truth of "good" and "bad"? Do you manually run the workflow and then decide based on a predefined metric?
A new OSS framework uses multi-objective Bayesian Optimization to efficiently search for Pareto-optimal RAG workflows, balancing cost, accuracy, and latency across configurations that would be impossible to test manually.
Could I ignore all flow runs that have an estimated cost above a certain threshold, so that the overall cost of optimization is less? Suppose I choose an acceptable level of accuracy and then skip some costlier exploration stuff. Is there a risk it doesn't find certain optimal configurations even under my cost limit?
How does the system deal with long context-length documents that some models can handle and others can't? Does this approach work for custom models?
Are you going to flesh out the docs? Looks like the folder only has two markdown files right now.
Any recommendations for creating the initial QA dataset for benchmarking? Maybe creating a basic RAG system, using those search results and generations as the baseline and then having humans check and edit them to be more comprehensive and accurate. Any chance this is on the roadmap?
Cool stuff, I'm hoping this approach is more widely adopted!
...would it be accurate to say that syftr finds Pareto-optimal choices across cost, accuracy, and latency, where accuracy is decided by an LLM whose assessments are 90% correlated to that of human labelers?
Are there 3 objectives: cost, accuracy, and latency or 2: cost and accuracy?
It sounds impossible to be paretooptimal for complicated problems. How do you know GPT-4o-mini would be optimal. I feel like there is always room on the table for a potential GPT-5o-mini to be more optimal. The solution space of possible gen ai models is gigantic, so we can only improve our solution over time and never find the most optimal one.
Yes, maybe theoretically. Practically though you will have to ship your agent with the LLMs that are available today and you will need to pick one. I don’t think the authors were trying to solve for like “best forever”,probably wasn’t their intent. For that I think you would need some kind of proof which sort of says that some kind of theoretical maximum is reached, and a proof like that is not a thing in most applied computer science fields.
Interesting but I'm a bit lost. You are optimising but how do you know the ground truth of "good" and "bad"? Do you manually run the workflow and then decide based on a predefined metric?
Or do you rely on generic benchmarks?
https://github.com/datarobot/syftr/blob/main/docs/datasets.m...
You need custom QA pairs for custom scenarios.
A new OSS framework uses multi-objective Bayesian Optimization to efficiently search for Pareto-optimal RAG workflows, balancing cost, accuracy, and latency across configurations that would be impossible to test manually.
Useful links:
Github: https://github.com/datarobot/syftr
Paper: https://arxiv.org/abs/2505.20266
I am a member of the syftr team. Please feel free to ask questions.
This looks super cool! Seems like a stronger statistics-based optimization strategy than https://docs.auto-rag.com/optimization/optimization.html.
I've got a few questions:
Could I ignore all flow runs that have an estimated cost above a certain threshold, so that the overall cost of optimization is less? Suppose I choose an acceptable level of accuracy and then skip some costlier exploration stuff. Is there a risk it doesn't find certain optimal configurations even under my cost limit?
How does the system deal with long context-length documents that some models can handle and others can't? Does this approach work for custom models?
Suppose I want to create and optimize for my own LLM-as-a-judge metrics like https://mastra.ai/en/docs/evals/textual-evals#available-metr..., how can I do this?
Are you going to flesh out the docs? Looks like the folder only has two markdown files right now.
Any recommendations for creating the initial QA dataset for benchmarking? Maybe creating a basic RAG system, using those search results and generations as the baseline and then having humans check and edit them to be more comprehensive and accurate. Any chance this is on the roadmap?
Cool stuff, I'm hoping this approach is more widely adopted!
Given section A7 in your paper: https://arxiv.org/pdf/2505.20266
...would it be accurate to say that syftr finds Pareto-optimal choices across cost, accuracy, and latency, where accuracy is decided by an LLM whose assessments are 90% correlated to that of human labelers?
Are there 3 objectives: cost, accuracy, and latency or 2: cost and accuracy?
It sounds impossible to be paretooptimal for complicated problems. How do you know GPT-4o-mini would be optimal. I feel like there is always room on the table for a potential GPT-5o-mini to be more optimal. The solution space of possible gen ai models is gigantic, so we can only improve our solution over time and never find the most optimal one.
Yes, maybe theoretically. Practically though you will have to ship your agent with the LLMs that are available today and you will need to pick one. I don’t think the authors were trying to solve for like “best forever”,probably wasn’t their intent. For that I think you would need some kind of proof which sort of says that some kind of theoretical maximum is reached, and a proof like that is not a thing in most applied computer science fields.
looks interesting!
and exhausting.
[dead]