Fundamentals

Context Window and Long-Context Understanding

How context windows work, techniques for extending them, and strategies for managing long documents with LLMs

Published: 2026-04-05 · Last updated: 2026-04-13

Context Window and Long-Context Understanding

The context window is the maximum number of tokens (input + output) an LLM can process in a single forward pass. It's one of the most important practical constraints when working with LLMs.

Context Window Evolution

Model	Context Window	Year
GPT-2	1,024	2019
GPT-3	2,048 → 4,096	2020
Claude 2	100,000	2023
GPT-4 Turbo	128,000	2023
Claude 3	200,000	2024
Gemini 1.5 Pro	1,000,000	2024
Claude 3.5/4	200,000	2024-2025
GPT-4o	128,000	2024
Llama 3.1	128,000	2024
Qwen 2.5	128,000	2024

How Context Windows Work

The Attention Mechanism's Role

Self-attention computes pairwise interactions between all tokens:

# Attention complexity: O(n²) for sequence length n
# Each token computes attention scores with every other token

# For 128K context:
# Attention matrix = 128,000 × 128,000 = 16.4 billion entries
# At float16: 32.8 GB just for the attention matrix

Memory Breakdown for Long Context

For a 70B model with 128K context:

Component	Memory
Model weights (FP16)	140 GB
KV cache (128K tokens)	40-80 GB
Activations	20-40 GB
Total	200-260 GB

This is why long-context inference requires multiple high-memory GPUs.

Extending Context Beyond Training Length

Most models are trained on 4K-8K context but support much longer windows at inference time through position extrapolation.

Linear RoPE Scaling

# RoPE encodes position as rotation angles
# At inference time, scale the angles to extend range

def rope_scaling(positions, dim, base=10000, scaling_factor=8.0):
    """Scale RoPE frequencies to support longer sequences."""
    inv_freq = 1.0 / (base ** (torch.arange(0, dim, 2).float() / dim))
    # Apply scaling factor to extend range
    inv_freq = inv_freq / scaling_factor
    return torch.outer(positions, inv_freq)

YaRN (Yet another RoPE extensioN)

Combines RoPE scaling with attention temperature adjustment:

# YaRN: scale positions AND adjust attention temperature
# Supports 128× extrapolation with minimal quality loss

def yarn_attention(Q, K, V, scaling_factor, temperature):
    scaled_positions = torch.arange(seq_len) / scaling_factor
    # Apply scaled RoPE
    Q = apply_rope(Q, scaled_positions)
    K = apply_rope(K, scaled_positions)
    # Adjust temperature for scaled attention
    scores = torch.matmul(Q, K.T) / (temperature * scaling_factor)
    return torch.softmax(scores, dim=-1) @ V

NTK-Aware Scaling

Interpolates between training and target position encodings:

# NTK-aware: dynamically adjust base frequency
def ntk_aware_rope(context_length, trained_length, base=10000):
    scaling = context_length / trained_length
    # NTK-aware adjustment (less aggressive than linear)
    new_base = base * (scaling ** (dim / (dim - 2))) ** (2 / dim)
    return new_base

The "Lost in the Middle" Problem

Research shows that LLMs are less accurate when key information appears in the middle of long contexts:

Beginning of context  ◄──── Strong recall ────►  End of context
                              │
                          Weaker recall
                        (middle section)

Mitigation strategies:

RAG retrieval: Put most relevant chunks first and last
Summarize-then-answer: Compress middle sections
Windowed attention: Process context in overlapping windows

Practical Context Management

Estimating Context Usage

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")

def estimate_context_usage(messages: list[dict], max_tokens: int = 128000) -> dict:
    total = 0
    for msg in messages:
        total += len(enc.encode(str(msg)))
    
    return {
        "tokens_used": total,
        "tokens_remaining": max_tokens - total,
        "percentage_used": f"{(total/max_tokens)*100:.1f}%",
        "approx_words": total // 1.3  # Rough conversion
    }

Truncation Strategies

When context exceeds limits:

Strategy	Description	Best For
First-N	Keep beginning, drop end	Documents where key info is early
Last-N	Keep end, drop beginning	Conversations (recent context matters most)
Sliding Window	Keep most recent N tokens	Streaming conversations
Summarize Old	Compress older content	Long documents
Selective Retrieval	Embed + retrieve relevant chunks	Knowledge bases (RAG)

def sliding_window_context(messages: list[dict], 
                           max_tokens: int, 
                           keep_system: bool = True) -> list[dict]:
    """Keep most recent messages that fit in context."""
    import tiktoken
    enc = tiktoken.encoding_for_model("gpt-4o")
    
    # Always keep system message
    system_msg = messages[0] if messages[0]["role"] == "system" else None
    remaining = messages[1:] if system_msg else messages[:]
    
    # Count system message tokens
    used = len(enc.encode(str(system_msg))) if system_msg else 0
    
    # Add messages from most recent until we hit limit
    result = []
    for msg in reversed(remaining):
        msg_tokens = len(enc.encode(str(msg)))
        if used + msg_tokens > max_tokens:
            break
        result.append(msg)
        used += msg_tokens
    result.reverse()
    
    if system_msg:
        result = [system_msg] + result
    return result

Long-Context Benchmarks

Benchmark	Task	Measures
Needle in a Haystack	Find specific fact in long text	Retrieval accuracy
RULER	Multiple long-context tasks	Comprehensive
LongBench	6 task categories	Cross-model comparison
InfiniteBench	Up to 2M tokens	Extreme context

Key Takeaways

Context windows have grown from 4K to 1M+ tokens in 5 years
The O(n²) attention complexity makes long context expensive
Position extrapolation (RoPE scaling) extends context beyond training length
"Lost in the middle" effect means information placement matters
RAG is often more practical than raw long-context for large knowledge bases

Transformer Architecture — How attention works under the hood
RAG Systems — Retrieval-based context management
Inference Optimization — Memory management for long contexts

Related docs

LLM Memory Systems

Building persistent memory for LLM applications — short-term vs long-term memory, vector-based recall, summarization memory, and memory-augmented reasoning

Embeddings & Semantic Search

Building production semantic search systems — embedding model selection, indexing strategies, query processing, relevance tuning, and hybrid search

Attention Mechanisms Variants

A deep technical survey of attention variants — from scaled dot-product to FlashAttention, linear attention, and state space alternatives

Related agents

Data Platform agent blueprint focused on maintain durable task state, summarize interaction history, and preserve only the context worth carrying forward for analysts and engineers need better query generation, pipeline debugging, and dataset explanation across changing schemas.

Data Platform Retrieval Agent

Data Platform agent blueprint focused on find the right internal knowledge quickly and package it into grounded context for downstream responses or actions for analysts and engineers need better query generation, pipeline debugging, and dataset explanation across changing schemas.

Developer Productivity Memory Agent

Developer Productivity agent blueprint focused on maintain durable task state, summarize interaction history, and preserve only the context worth carrying forward for engineering teams want reliable help with issue triage, runbook guidance, and change review without obscuring system ownership.

Context Window and Long-Context Understanding

Related docs

LLM Memory Systems

Embeddings & Semantic Search

Attention Mechanisms Variants

Related agents

Data Platform Memory Agent

Data Platform Retrieval Agent

Developer Productivity Memory Agent