The usual benchmarks for language models—Exact Match, F1, and even multi-hop QA datasets—weren’t designed to measure what matters most about persistent AI memory: connecting concepts across time, documents, and contexts.
We just completed our most extensive internal evaluation of cognee to date, using HotPotQA as a baseline. While the results showed strong gains, they also reinforced a growing realization: we need better ways to evaluate how AI memory systems actually perform.
We ran Cognee through 45 evaluation cycles on 24 questions from HotPotQA, using ChatGPT 4o for the analysis. Each part of the evaluation process is affected by the inherent variance in GPT’s output: cognification, answer generation, and answer evaluation. We especially noticed significant variance across different metrics on small runs, which is why we chose the repeated, end-to-end approach.
We compared results using the same questions and setup with:
Mem0
Lightrag
Graphiti
While they are standard in QA, EM and F1 scores reward surface-level overlap and miss the core value proposition of AI memory systems. For example, a syntactically perfect answer can be factually wrong, and a fuzzy-but-correct response can be penalized for missing the reference phrasing.
LLMs are inconsistent, that is another issue.
Even HotPotQA assumes all relevant information sits neatly in two paragraphs. That’s not how memory works. Real-world AI memory systems need to link information across documents, conversations, and knowledge domains that traditional QA benchmarks just can’t capture.
Consider the difference:
Traditional QA:
“What year was the company that acquired X founded?”
Memory Challenge:
“How do the concerns raised in last month’s security review relate to the authentication changes discussed in the architecture meeting three weeks ago?”
Only one of these tests long-term knowledge, reasoning across sources, and organizational memory—care to guess which one?
We are working on a new dataset and benchmarks to measure memory, and would love feedback!
The usual benchmarks for language models—Exact Match, F1, and even multi-hop QA datasets—weren’t designed to measure what matters most about persistent AI memory: connecting concepts across time, documents, and contexts.
We just completed our most extensive internal evaluation of cognee to date, using HotPotQA as a baseline. While the results showed strong gains, they also reinforced a growing realization: we need better ways to evaluate how AI memory systems actually perform.
We ran Cognee through 45 evaluation cycles on 24 questions from HotPotQA, using ChatGPT 4o for the analysis. Each part of the evaluation process is affected by the inherent variance in GPT’s output: cognification, answer generation, and answer evaluation. We especially noticed significant variance across different metrics on small runs, which is why we chose the repeated, end-to-end approach.
We compared results using the same questions and setup with:
Mem0 Lightrag Graphiti
While they are standard in QA, EM and F1 scores reward surface-level overlap and miss the core value proposition of AI memory systems. For example, a syntactically perfect answer can be factually wrong, and a fuzzy-but-correct response can be penalized for missing the reference phrasing.
LLMs are inconsistent, that is another issue.
Even HotPotQA assumes all relevant information sits neatly in two paragraphs. That’s not how memory works. Real-world AI memory systems need to link information across documents, conversations, and knowledge domains that traditional QA benchmarks just can’t capture.
Consider the difference:
Traditional QA:
“What year was the company that acquired X founded?”
Memory Challenge:
“How do the concerns raised in last month’s security review relate to the authentication changes discussed in the architecture meeting three weeks ago?”
Only one of these tests long-term knowledge, reasoning across sources, and organizational memory—care to guess which one?
We are working on a new dataset and benchmarks to measure memory, and would love feedback!