Advanced Technical
LLM Memory Systems
Building persistent memory for LLM applications — short-term vs long-term memory, vector-based recall, summarization memory, and memory-augmented reasoning
Published: 2026-04-23 · Last updated: 2026-04-13
LLM Memory Systems
LLMs are inherently stateless — each request is independent of all others. Building memory systems transforms them from one-shot responders into persistent, context-aware assistants that learn and adapt over time. This guide covers memory architectures from simple context windows to sophisticated multi-layer memory systems.
The Memory Hierarchy
Human-inspired memory layers for LLM systems:
┌─────────────────────────────────────────────┐
│ Sensory Memory (Immediate) │ Current token generation
│ Duration: milliseconds │
├─────────────────────────────────────────────┤
│ Working Memory (Context Window) │ Active conversation
│ Duration: single session │
│ Capacity: 4K - 200K tokens │
├─────────────────────────────────────────────┤
│ Short-Term Memory (Recent History) │ Recent conversations
│ Duration: hours to days │
│ Storage: Redis / in-memory cache │
├─────────────────────────────────────────────┤
│ Long-Term Memory (Semantic Store) │ Facts, preferences, patterns
│ Duration: weeks to months │
│ Storage: Vector database │
├─────────────────────────────────────────────┤
│ Procedural Memory (Learned Behavior) │ System prompts, fine-tuning
│ Duration: permanent │
│ Storage: Config files, model weights │
└─────────────────────────────────────────────┘
Working Memory Management
Context Window Optimization
The most immediate form of memory is the context window itself. Effective management is critical:
class ContextWindowManager:
"""Manages the conversation context within token limits."""
def __init__(self, max_tokens: int = 128_000, reserve_tokens: int = 4096):
self.max_tokens = max_tokens - reserve_tokens # Reserve for response
self.messages: list[dict] = []
self.token_counts: list[int] = []
def add_message(self, role: str, content: str, tokenizer) -> bool:
"""Add a message, trimming if necessary to stay within limits."""
msg_tokens = tokenizer.count_tokens(content)
# If single message exceeds limit, truncate it
if msg_tokens > self.max_tokens:
content = tokenizer.truncate(content, max_tokens=self.max_tokens)
msg_tokens = self.max_tokens
# Trim oldest messages until we fit
total_tokens = sum(self.token_counts) + msg_tokens
while total_tokens > self.max_tokens and self.messages:
removed = self.messages.pop(0)
removed_tokens = self.token_counts.pop(0)
total_tokens -= removed_tokens
self.messages.append({"role": role, "content": content})
self.token_counts.append(msg_tokens)
return True
def get_context(self) -> list[dict]:
"""Return the current context for the LLM."""
return self.messages.copy()
def estimated_tokens(self) -> int:
return sum(self.token_counts)
def remaining_capacity(self) -> int:
return self.max_tokens - self.estimated_tokens()
Sliding Window vs. Summarization
Two main strategies for managing conversation history:
class SlidingWindowMemory:
"""Keep the most recent N messages, drop the rest."""
def __init__(self, window_size: int = 20):
self.window_size = window_size
self.messages: list[dict] = []
def add(self, message: dict):
self.messages.append(message)
if len(self.messages) > self.window_size:
self.messages = self.messages[-self.window_size:]
def get(self) -> list[dict]:
return self.messages
class SummarizationMemory:
"""Summarize old conversation, keep recent messages raw."""
def __init__(self, llm, summarize_threshold: int = 15, keep_recent: int = 5):
self.llm = llm
self.summarize_threshold = summarize_threshold
self.keep_recent = keep_recent
self.messages: list[dict] = []
self.summary: str = ""
async def add(self, message: dict):
self.messages.append(message)
if len(self.messages) > self.summarize_threshold:
# Summarize the oldest messages
to_summarize = self.messages[:self.summarize_threshold - self.keep_recent]
new_summary = await self._summarize(to_summarize)
# Update summary and keep only recent messages
if self.summary:
self.summary = await self._merge_summaries(self.summary, new_summary)
else:
self.summary = new_summary
self.messages = self.messages[self.summarize_threshold - self.keep_recent:]
async def get(self) -> list[dict]:
context = []
if self.summary:
context.append({
"role": "system",
"content": f"Summary of earlier conversation: {self.summary}",
})
context.extend(self.messages)
return context
async def _summarize(self, messages: list[dict]) -> str:
prompt = f"""Summarize this conversation in 3-5 sentences, capturing key facts, decisions, and user preferences:
{messages}
Summary:"""
response = await self.llm.generate(prompt, max_tokens=200)
return response.text.strip()
async def _merge_summaries(self, old_summary: str, new_summary: str) -> str:
prompt = f"""Merge these two conversation summaries into one coherent summary:
Previous summary: {old_summary}
New summary: {new_summary}
Merged summary:"""
response = await self.llm.generate(prompt, max_tokens=300)
return response.text.strip()
Long-Term Memory with Vector Storage
Memory Storage and Retrieval
class LongTermMemory:
"""Persistent memory store using vector embeddings for semantic retrieval."""
def __init__(self, vector_db, embedding_fn, user_id: str):
self.vector_db = vector_db
self.embedding_fn = embedding_fn
self.user_id = user_id
async def store(self, text: str, metadata: dict = None):
"""Store a memory with semantic embedding."""
embedding = self.embedding_fn(text)
doc = {
"embedding": embedding,
"text": text,
"metadata": {
**(metadata or {}),
"user_id": self.user_id,
"timestamp": datetime.utcnow().isoformat(),
},
}
await self.vector_db.insert(doc)
async def retrieve(self, query: str, top_k: int = 5) -> list[dict]:
"""Retrieve relevant memories for the current query."""
query_embedding = self.embedding_fn(query)
results = await self.vector_db.search(
embedding=query_embedding,
top_k=top_k,
filter={"user_id": self.user_id},
)
return [{"text": r["text"], "metadata": r["metadata"]} for r in results]
async def store_conversation(self, messages: list[dict]):
"""Extract and store key facts from a conversation."""
facts = await self._extract_facts(messages)
for fact in facts:
await self.store(fact, metadata={"type": "fact", "source": "conversation"})
async def _extract_facts(self, messages: list[dict]) -> list[str]:
"""Use LLM to extract factual information from conversation."""
prompt = f"""Extract all factual information about the user from this conversation.
Focus on: preferences, personal details, goals, constraints, and learned patterns.
Conversation:
{messages}
Return one fact per line. Only include information that would be useful to remember for future conversations.
If there are no facts, return 'None.'"""
response = await self.llm.generate(prompt, max_tokens=500)
facts = [line.strip() for line in response.text.strip().split("\n") if line.strip() and line.strip() != "None"]
return facts
Memory Consolidation
Like human sleep-based memory consolidation, periodically reorganize and compress memories:
class MemoryConsolidator:
"""Periodically consolidate, deduplicate, and organize memories."""
def __init__(self, vector_db, llm, embedding_fn):
self.vector_db = vector_db
self.llm = llm
self.embedding_fn = embedding_fn
async def consolidate(self, user_id: str):
"""Run memory consolidation for a user."""
# 1. Retrieve all memories
all_memories = await self.vector_db.get_all(user_id=user_id)
# 2. Cluster similar memories
clusters = await self._cluster_memories(all_memories)
# 3. For each cluster, create a consolidated memory
for cluster in clusters:
if len(cluster) > 1:
# Merge multiple similar memories into one
consolidated = await self._merge_memories(cluster)
# Delete the originals
for memory in cluster:
await self.vector_db.delete(memory["id"])
# Store the consolidated memory
await self.vector_db.insert({
"embedding": self.embedding_fn(consolidated),
"text": consolidated,
"metadata": {"type": "consolidated", "user_id": user_id},
})
# 4. Delete stale/low-value memories
await self._prune_stale_memories(user_id)
async def _cluster_memories(self, memories: list[dict]) -> list[list[dict]]:
"""Cluster similar memories using embedding similarity."""
from sklearn.cluster import DBSCAN
import numpy as np
embeddings = np.array([m["embedding"] for m in memories])
clustering = DBSCAN(eps=0.3, min_samples=2, metric="cosine").fit(embeddings)
clusters: dict[int, list] = {}
for memory, label in zip(memories, clustering.labels_):
clusters.setdefault(label, []).append(memory)
return list(clusters.values())
async def _merge_memories(self, cluster: list[dict]) -> str:
"""Merge a cluster of memories into a single consolidated memory."""
texts = [m["text"] for m in cluster]
prompt = f"""Consolidate these related memories into a single, comprehensive statement:
{chr(10).join(f"- {t}" for t in texts)}
Consolidated memory:"""
response = await self.llm.generate(prompt, max_tokens=200)
return response.text.strip()
async def _prune_stale_memories(self, user_id: str, max_age_days: int = 90):
"""Remove memories that are old and haven't been accessed."""
cutoff = (datetime.utcnow() - timedelta(days=max_age_days)).isoformat()
stale = await self.vector_db.query(
f"SELECT id FROM memories WHERE user_id='{user_id}' AND timestamp < '{cutoff}' AND access_count = 0"
)
for memory in stale:
await self.vector_db.delete(memory["id"])
Memory-Augmented Reasoning
Retrieval-Augmented Reasoning
Integrate long-term memory directly into the reasoning process:
class MemoryAugmentedAgent:
"""An agent that uses long-term memory to enhance its reasoning."""
def __init__(self, llm, memory: LongTermMemory, max_memories: int = 5):
self.llm = llm
self.memory = memory
self.max_memories = max_memories
async def respond(self, user_message: str, conversation_history: list[dict] = None) -> str:
"""Generate a response enhanced with relevant memories."""
# Step 1: Retrieve relevant memories
memories = await self.memory.retrieve(user_message, top_k=self.max_memories)
# Step 2: Construct enhanced prompt
system_prompt = self._build_system_prompt(memories)
messages = [
{"role": "system", "content": system_prompt},
]
if conversation_history:
messages.extend(conversation_history)
messages.append({"role": "user", "content": user_message})
# Step 3: Generate response
response = await self.llm.chat(messages)
# Step 4: Store new information from this interaction
await self.memory.store_conversation(
[{"role": "user", "content": user_message}, {"role": "assistant", "content": response}]
)
return response
def _build_system_prompt(self, memories: list[dict]) -> str:
memory_context = ""
if memories:
memory_context = "\nRelevant information you know about this user:\n"
for m in memories:
memory_context += f"- {m['text']}\n"
return f"""You are a personalized assistant that remembers details about the user.{memory_context}
Use these memories to provide more relevant and personalized responses. Only reference memories when they are directly relevant to the user's question.
Do not mention that you are referencing stored memories — just use the information naturally."""
Episodic vs. Semantic Memory
class DualMemorySystem:
"""Separate episodic (event-based) and semantic (fact-based) memory."""
def __init__(self, episodic_db, semantic_db, embedding_fn):
self.episodic_db = episodic_db # Stores specific events
self.semantic_db = semantic_db # Stores general facts
self.embedding_fn = embedding_fn
async def store_experience(self, event: str, metadata: dict = None):
"""Store a specific event in episodic memory."""
embedding = self.embedding_fn(event)
await self.episodic_db.insert({
"embedding": embedding,
"text": event,
"metadata": {"type": "episodic", **(metadata or {})},
})
async def store_fact(self, fact: str, confidence: float = 1.0):
"""Store a general fact in semantic memory."""
embedding = self.embedding_fn(fact)
await self.semantic_db.insert({
"embedding": embedding,
"text": fact,
"metadata": {"type": "semantic", "confidence": confidence},
})
async def retrieve_context(self, query: str) -> dict:
"""Retrieve both episodic and semantic context for a query."""
episodic = await self.episodic_db.search(self.embedding_fn(query), top_k=3)
semantic = await self.semantic_db.search(self.embedding_fn(query), top_k=5)
return {
"episodes": [e["text"] for e in episodic],
"facts": [{"text": s["text"], "confidence": s["metadata"].get("confidence", 1.0)} for s in semantic],
}
async def distill_facts_from_episodes(self):
"""Periodically extract general facts from episodic memories."""
recent_episodes = await self.episodic_db.get_recent(top_k=50)
if not recent_episodes:
return
episode_texts = [e["text"] for e in recent_episodes]
prompt = f"""From these experiences, extract general facts and preferences that would be useful to remember:
{chr(10).join(f"- {t}" for t in episode_texts)}
Extract facts (one per line). Only include information that generalizes beyond the specific event:"""
response = await self.llm.generate(prompt, max_tokens=300)
for line in response.text.strip().split("\n"):
if line.strip():
await self.store_fact(line.strip(), confidence=0.8)
Memory Evaluation
| Metric | Description | Target |
|---|---|---|
| Recall Precision | Retrieved memories are relevant to the query | >70% |
| Recall Coverage | Relevant memories are found among all stored | >50% |
| Memory Freshness | How quickly new information is stored | <5 seconds |
| Consolidation Quality | Merged memories preserve key information | Subjective eval |
| Personalization Score | Responses improve with memory vs. without | >20% improvement |
| Memory Bloat | Redundant/duplicate memories stored | <15% |
Production Architecture
User Request
│
▼
┌───────────────────────┐
│ Memory Retrieval │ Query both episodic and semantic
│ ├─ Episodic (recent) │
│ └─ Semantic (facts) │
└──────────┬────────────┘
▼
┌───────────────────────┐
│ Context Assembly │ Combine memories into prompt
│ ├─ System prompt │
│ ├─ Relevant memories │
│ └─ Conversation hist │
└──────────┬────────────┘
▼
┌──────────────┐
│ LLM │ Generate with memory context
└──────┬───────┘
▼
┌───────────────────────┐
│ Memory Extraction │ Extract facts from interaction
│ ├─ Store new facts │
│ └─ Update episodes │
└───────────────────────┘
class ProductionMemorySystem:
"""Production-ready memory system with all components."""
def __init__(self, config: dict):
self.working_memory = SummarizationMemory(
llm=config["llm"],
summarize_threshold=config.get("summarize_threshold", 15),
keep_recent=config.get("keep_recent", 5),
)
self.long_term_memory = LongTermMemory(
vector_db=config["vector_db"],
embedding_fn=config["embedding_fn"],
)
self.consolidator = MemoryConsolidator(
vector_db=config["vector_db"],
llm=config["llm"],
embedding_fn=config["embedding_fn"],
)
async def process(self, user_id: str, message: str) -> str:
# 1. Retrieve long-term memories
memories = await self.long_term_memory.retrieve(message, top_k=5)
# 2. Add to working memory
await self.working_memory.add({"role": "user", "content": message})
# 3. Build prompt with all memory layers
context = await self.working_memory.get()
context = self._inject_memories(context, memories)
# 4. Generate response
response = await self.llm.chat(context)
# 5. Store new information
await self.working_memory.add({"role": "assistant", "content": response})
await self.long_term_memory.store_conversation(
[{"role": "user", "content": message}, {"role": "assistant", "content": response}]
)
return response
def schedule_consolidation(self, user_id: str, interval_hours: int = 24):
"""Schedule periodic memory consolidation."""
import asyncio
async def consolidate_loop():
while True:
await self.consolidator.consolidate(user_id)
await asyncio.sleep(interval_hours * 3600)
asyncio.create_task(consolidate_loop())
def _inject_memories(self, context: list[dict], memories: list[dict]) -> list[dict]:
"""Inject retrieved memories into the conversation context."""
if not memories:
return context
memory_text = "Relevant context:\n" + "\n".join(f"- {m['text']}" for m in memories)
# Insert as a system message at the beginning
system_msg = context[0] if context[0]["role"] == "system" else None
if system_msg:
system_msg["content"] += f"\n\n{memory_text}"
else:
context.insert(0, {"role": "system", "content": memory_text})
return context
Cross-References
- For agent architectures that use memory, see AI Agent Architectures
- For context window management techniques, see Context Window Management
- For vector databases used in memory storage, see Vector Databases Comparison
- For RAG-based memory retrieval, see RAG — Retrieval-Augmented Generation
Related docs
Vector Databases Comparison
Deep comparison of FAISS, Pinecone, Weaviate, Milvus, Chroma, and pgvector — performance characteristics, scaling guides, and selection guidance
Embeddings & Semantic Search
Building production semantic search systems — embedding model selection, indexing strategies, query processing, relevance tuning, and hybrid search
Data Platform Memory Agent Implementation Guide
Architecture, workflow design, metrics, and rollout guidance for a data platform memory agent in production.
Related agents
Data Platform Memory Agent
Data Platform agent blueprint focused on maintain durable task state, summarize interaction history, and preserve only the context worth carrying forward for analysts and engineers need better query generation, pipeline debugging, and dataset explanation across changing schemas.
Developer Productivity Memory Agent
Developer Productivity agent blueprint focused on maintain durable task state, summarize interaction history, and preserve only the context worth carrying forward for engineering teams want reliable help with issue triage, runbook guidance, and change review without obscuring system ownership.
Finance Operations Memory Agent
Finance Operations agent blueprint focused on maintain durable task state, summarize interaction history, and preserve only the context worth carrying forward for finance teams need faster reconciliation, exception review, and policy-aware reporting for recurring operational workflows.