Architecture & Training

Emergent Capabilities and Reasoning

Understanding how complex behaviors emerge at scale — chain of thought, planning, tool use, and the debate over whether LLMs truly reason

Published: 2026-04-10 · Last updated: 2026-04-13

Emergent Capabilities and Reasoning

One of the most remarkable aspects of LLMs is that certain capabilities appear only at specific scale thresholds — they cannot be predicted by extrapolating from smaller models. These "emergent abilities" include chain-of-thought reasoning, multi-step planning, and genuine tool use.

What Are Emergent Capabilities?

Emergent capabilities are behaviors that:

Are essentially absent in smaller models (< 10B parameters)
Appear sharply as model scale increases
Were not explicitly trained for

Capability Accuracy ▲
    │
100%│                              ┌────── Emergent (step function)
    │                            ┌─┘
 80%│                          ┌─┘
    │                        ┌─┘
 60%│                      ┌─┘
    │                    ┌─┘
 40%│  ────────────────┌─┘         ──── Smooth scaling (predictable)
    │
 20%│
    │
  0%└────┬─────┬─────┬─────┬─────┬─────► Model Size
       1B    10B   100B  1T    10T
              ↑
         Emergence threshold

Important debate: Some researchers argue emergence is an illusion caused by measuring accuracy instead of loss. When measured by loss, capabilities may scale smoothly but appear sudden on accuracy metrics.

Key Emergent Capabilities

1. Chain-of-Thought Reasoning

Question: "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. 
           Each can has 3 tennis balls. How many does he have now?"

Small model: "5 + 2 = 7" (wrong — doesn't account for 3 balls per can)

Large model with CoT: 
  "Roger starts with 5 balls.
   He buys 2 cans, each with 3 balls.
   2 × 3 = 6 new balls.
   5 + 6 = 11.
   Answer: 11"

Trigger phrase: "Let's think step by step" increases accuracy on math word problems by 10-40% for models > 60B parameters but has negligible effect on smaller models.

2. Multi-Hop Reasoning

Question: "Who was the president of the country that hosted the 2008 Olympics?"

Requires:
1. Identify country: China hosted 2008 Olympics
2. Identify president: Who leads China? → President Xi Jinping
3. Combine: Answer is Xi Jinping

Small model: Answers with "Beijing" (misses the question)
Large model: Correctly chains the reasoning

3. Code Understanding and Generation

# Small model: generates syntactically incorrect or non-functional code
# Large model: generates working code with proper error handling

# User: "Write a function to find the longest common subsequence"

# Large model output (working):
def longest_common_subsequence(s1: str, s2: str) -> int:
    """Find length of LCS using dynamic programming."""
    m, n = len(s1), len(s2)
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i-1] == s2[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])
    
    return dp[m][n]

4. Theory of Mind

Question: "Sally puts a marble in a basket and leaves. 
           Anne moves the marble to a box. 
           When Sally returns, where will she look for the marble?"

Small model: "In the box" (knows where the marble actually is)
Large model: "In the basket (Sally doesn't know Anne moved it)" (understands false belief)

5. In-Context Learning

# Without explicit training on the task format
Input: "glorft → glorfts, wug → wugs, blick → ?"
Output: "blicks" (correctly infers pluralization rule from 2 examples)

This capability — learning a new task from a few examples in the prompt — emerges around 10-30B parameters.

Enhancing Reasoning at Inference Time

Best-of-N Sampling

Generate N responses, select the best using a verifier:

def best_of_n(model, prompt, n=10, verifier=None):
    """Generate N responses, pick the best."""
    candidates = []
    for _ in range(n):
        response = model.generate(prompt, temperature=0.7)
        score = verifier(response) if verifier else model.self_score(response)
        candidates.append((score, response))
    
    return max(candidates, key=lambda x: x[0])[1]

Trade-off: N× compute cost for ~5-15% accuracy gain on reasoning tasks.

Self-Consistency

Generate multiple reasoning chains, take majority vote:

def self_consistency(model, prompt, n=40):
    """Multiple CoT chains, majority vote on final answer."""
    answers = []
    for _ in range(n):
        response = model.generate(prompt + "\nLet's think step by step.")
        answer = extract_final_answer(response)
        answers.append(answer)
    
    # Majority vote
    return Counter(answers).most_common(1)[0][0]

On GSM8K (math benchmark), self-consistency with 40 samples improves GPT-4 accuracy by ~8%.

Tree of Thoughts

Explore a tree of reasoning paths, backing up from dead ends:

def tree_of_thoughts(model, prompt, max_depth=5, beam_width=3):
    """Systematically explore multiple reasoning paths."""
    # Level 0: generate initial thoughts
    thoughts = model.generate_thoughts(prompt, k=beam_width)
    
    for depth in range(max_depth):
        new_thoughts = []
        for thought in thoughts:
            # Evaluate current path
            score = model.evaluate_state(thought)
            if is_solution(thought):
                return thought
            
            # Expand promising states
            if score > threshold:
                new_thoughts.extend(
                    model.generate_thoughts(thought, k=beam_width)
                )
        
        # Keep top beam_width states
        thoughts = sorted(new_thoughts, key=score, reverse=True)[:beam_width]
    
    return thoughts[0]  # Return best found

Reasoning Models

OpenAI's o1/o3 and DeepSeek's R1 represent a new paradigm: models specifically optimized for reasoning through:

Process supervision: Reward correct reasoning steps, not just correct answers
Extended generation: Models "think" for longer before answering
Self-reflection: Models check and revise their own reasoning

User: "Prove that √2 is irrational."

o1/R1-style model:
[Thinking for 30 seconds / generating 2000+ tokens of reasoning]
"Proof by contradiction:
Assume √2 is rational. Then √2 = a/b where a,b are coprime integers.
Squaring: 2 = a²/b², so 2b² = a².
This means a² is even, so a is even. Let a = 2k.
Then 2b² = 4k², so b² = 2k², meaning b² is even, so b is even.
But if both a and b are even, they share a factor of 2, contradicting coprime.
Therefore √2 must be irrational. □"

The Reasoning Debate

Do LLMs truly reason, or just pattern-match?

Position	Argument	Evidence
They reason	Solve novel problems, chain logic, generalize	Performance on unseen math proofs
They pattern-match	All outputs are interpolation of training data	Fail on simple out-of-distribution tasks
Pragmatic view	The distinction may not matter for applications	Results are what matter

Key Takeaways

Emergent capabilities appear suddenly at scale, particularly reasoning and code generation
Chain-of-thought prompting is the simplest way to enhance reasoning
Best-of-N and self-consistency trade compute for accuracy
Reasoning models (o1, R1) optimize for deep thinking rather than fast responses
Whether LLMs "truly reason" is philosophically interesting but practically secondary to their demonstrated capabilities

Prompt Engineering — CoT and advanced reasoning prompts
Scaling Laws — How capabilities scale with model size
Evaluation Metrics — Measuring reasoning ability

Related docs

AI Agent Architectures

Designing and building agent systems — ReAct, Plan-and-Execute, tool-augmented agents, multi-agent systems, memory architectures, and production patterns

Data Platform Planner Agent Implementation Guide

Architecture, workflow design, metrics, and rollout guidance for a data platform planner agent in production.

Developer Productivity Planner Agent Implementation Guide

Architecture, workflow design, metrics, and rollout guidance for a developer productivity planner agent in production.

Related agents

Data Platform Planner Agent

Data Platform agent blueprint focused on break ambiguous work into explicit stages, dependencies, and success checks before any downstream execution happens for analysts and engineers need better query generation, pipeline debugging, and dataset explanation across changing schemas.

Developer Productivity Planner Agent

Developer Productivity agent blueprint focused on break ambiguous work into explicit stages, dependencies, and success checks before any downstream execution happens for engineering teams want reliable help with issue triage, runbook guidance, and change review without obscuring system ownership.

Finance Operations Planner Agent

Finance Operations agent blueprint focused on break ambiguous work into explicit stages, dependencies, and success checks before any downstream execution happens for finance teams need faster reconciliation, exception review, and policy-aware reporting for recurring operational workflows.