⭐️This is Day 11 of the JAPAN AI Advent Calendar 2025⭐️

Self Introduction

I’m Avinash, and I lead the JAPAN AI Lab team. Our job is to take what’s happening at the frontier of agentic AI research and figure out what actually can be incorporated into our platform and move the needle for customers.

Today I’d like to talk about memory for AI agents- one of the few levers we can tune without touching model weights, simply by changing what goes into the context window: system prompts, retrieval, tools, memory, and history. Lately, at JAPAN AI Lab we have focused on what makes a good memory for our agents and how to reuse past experience over time, drawing on ideas from reinforcement learning and neuroscience.

1. Introduction – Why Memory for AI Agents Is Hard

Almost everyone who has deployed an agent in production has seen the same pattern: sharp in the first few turns, increasingly forgetful as conversations and projects get longer. In enterprise settings, where projects span months and business logic changes constantly, an agent that forgets how your organization works creates real risk.

Scaling up context windows and just adding previous conversations/turns helps, but only to a point. LongMemEval shows that chat assistants can lose around 30% accuracy when operating over realistic multi‑session histories instead of short excerpts (Wu et al., 2024). At the same time, a growing ecosystem of memory systems - Mem0, Zep, MemGPT, ENGRAM, and others - publish strong benchmark numbers under different evaluation pipelines, making comparisons difficult.

For us at JAPAN AI, this is not abstract. In August 2025, we shipped Agent Memory (“エージェントメモリー”) as a feature of our AI agent products (JAPAN AI, 2025). It lets agents remember conversation contents, thought patterns, and work styles, and accumulate user insights as an organizational asset.

Building this feature forced us to confront three questions:

What do we mean by “memory”? Is it just longer context, or something closer to human episodic, semantic, and procedural memory?
How should we evaluate it? Benchmarks are valuable but fragile-small pipeline choices swing scores by double digits.
How should enterprise platforms actually implement memory? We need systems that are observable, governable, and robust to real‑world constraints.

This post organizes that space: a taxonomy grounded in cognitive science (Section 2), why we focus on memory

extraction

rather than storage formats (Section 3), reflection‑style methods on a concrete task (Section 4), case‑based reasoning and learnable selectors (Section 5), and where we think agent memory is heading (Section 6).

2. A Simple Lens for Agent Memory

Most systems that claim “memory” are really just larger prompts or RAG over recent turns. We distinguish three timescales—ephemeral (current prompt), task‑level (a single episode), and cross‑task (survives across runs)—and this post focuses on cross‑task memory.

Figure 1. Four types of cross‑task memory. Episodic stores concrete trajectories, semantic stores de‑contextualized facts, procedural stores reusable playbooks, and associative stores cue–value links (profiles, similarity scores).

Within cross‑task memory, we separate what the memory contains. Building on Tulving’s taxonomy (Tulving, 1972), we use: