Defining the evaluation problem

Before building an evaluation environment, we need to define what we’re measuring.

What is agent memory?

At its core, agent memory is any information that persists across interactions and influences future behavior. This includes:

User preferences — how someone likes to work, what they care about
Factual knowledge — things learned during conversations (project context, domain knowledge)
Procedural memory — patterns of behavior that improve over time (coding style, communication preferences)
Episodic memory — specific past interactions that inform future ones

What makes memory “good”?

A memory system should be evaluated on:

Recall accuracy — when relevant memory exists, does the agent retrieve it?
Precision — does the agent avoid surfacing irrelevant memories?
Timeliness — does the agent know when memories are stale or outdated?
Impact — does having the memory actually improve the agent’s output?
Efficiency — what’s the cost (latency, tokens, storage) of the memory system?

The evaluation challenge

The hard part is that memory quality is deeply contextual. A memory that’s invaluable in one conversation is noise in another. We need an evaluation framework that captures this nuance rather than reducing everything to simple retrieval metrics.

Next steps

Build a controlled evaluation environment where we can:

Simulate multi-turn, multi-session agent interactions
Inject known memories and verify they’re used correctly
Measure the five dimensions above
Compare different memory architectures head-to-head