Before building an evaluation environment, we need to define what we’re measuring.
What is agent memory?
At its core, agent memory is any information that persists across interactions and influences future behavior. This includes:
- User preferences — how someone likes to work, what they care about
- Factual knowledge — things learned during conversations (project context, domain knowledge)
- Procedural memory — patterns of behavior that improve over time (coding style, communication preferences)
- Episodic memory — specific past interactions that inform future ones
What makes memory “good”?
A memory system should be evaluated on:
- Recall accuracy — when relevant memory exists, does the agent retrieve it?
- Precision — does the agent avoid surfacing irrelevant memories?
- Timeliness — does the agent know when memories are stale or outdated?
- Impact — does having the memory actually improve the agent’s output?
- Efficiency — what’s the cost (latency, tokens, storage) of the memory system?
The evaluation challenge
The hard part is that memory quality is deeply contextual. A memory that’s invaluable in one conversation is noise in another. We need an evaluation framework that captures this nuance rather than reducing everything to simple retrieval metrics.
Next steps
Build a controlled evaluation environment where we can:
- Simulate multi-turn, multi-session agent interactions
- Inject known memories and verify they’re used correctly
- Measure the five dimensions above
- Compare different memory architectures head-to-head