Fundamentals

Tokenization and Embeddings

Understanding how LLMs convert text into numerical representations — tokenization algorithms, embedding spaces, and vocabulary design

Published: 2026-04-02 · Last updated: 2026-04-13

Tokenization and Embeddings

Before an LLM can process text, it must be converted from raw characters into numerical vectors. This two-step process — tokenization followed by embedding — is foundational to how all modern language models work.

Tokenization

Tokenization splits text into smaller units called tokens. Modern LLMs use subword tokenization, which balances vocabulary size with the ability to represent any word, including unknown ones.

Why Subword Tokenization?

ApproachExample: "unhappiness"Vocabulary SizeOOV Problem
Character-levelu-n-h-a-p-p-i-n-e-s-s (11 tokens)~100None
Word-levelunhappiness (1 token)500K+Unknown words fail
Subwordun-happi-ness (3 tokens)30K–200KRare

Common Tokenization Algorithms

Byte-Pair Encoding (BPE)

BPE starts with individual characters and iteratively merges the most frequent adjacent pairs.

# Simplified BPE intuition
text = "low lower lowest"
# Step 1: character-level: l-o-w l-o-w-e-r l-o-w-e-s-t
# Step 2: merge most frequent pair: lo-w lo-w-e-r lo-w-e-s-t
# Step 3: continue: low low-er low-est

Used by: GPT series, RoBERTa, Llama

WordPiece

Similar to BPE but merges based on likelihood rather than frequency.

Used by: BERT, ViT, most Google models

SentencePiece

Treats input as raw bytes and can train on any language without pre-tokenization. Supports both BPE and unigram language models.

Used by: T5, Llama (v2+), most modern multilingual models

Practical Tokenization

from transformers import AutoTokenizer

# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B")

# Tokenize text
text = "Tokenization is fundamental to NLP."
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text)

print(f"Tokens: {tokens}")
# ['Token', 'ization', ' is', ' fundamental', ' to', ' N', 'LP', '.']

print(f"Token IDs: {token_ids}")
# [12345, 67890, 456, 78901, 234, 56, 789, 12]

print(f"Token count: {len(token_ids)}")
# 8

Understanding Token Counts

# OpenAI's tiktoken for counting
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")
text = "LLMs process text through tokens."

tokens = enc.encode(text)
print(f"Characters: {len(text)}, Tokens: {len(tokens)}")
# Characters: 32, Tokens: 7

# Rule of thumb: ~1 token ≈ 4 characters ≈ 0.75 words (English)

Vocabulary Size Trade-offs

Vocabulary SizeProsCons
Small (10K–30K)Faster tokenization, smaller modelMore tokens per text, longer sequences
Medium (30K–100K)Balanced for most languagesGood trade-off
Large (100K–200K+)Fewer tokens per text, better for multilingualLarger embedding matrix, more memory

Embeddings

After tokenization, each token ID is mapped to a dense vector through an embedding lookup.

The Embedding Layer

import torch
import torch.nn as nn

vocab_size = 50257      # GPT-2 vocabulary size
embedding_dim = 768     # GPT-2 hidden dimension

embedding = nn.Embedding(vocab_size, embedding_dim)

# Token IDs for "Hello world"
token_ids = torch.tensor([15496, 995])

# Lookup embeddings
vectors = embedding(token_ids)
print(vectors.shape)  # torch.Size([2, 768])

Properties of Learned Embeddings

Embeddings capture semantic relationships through their geometry:

# Conceptual example (not actual model output)
king  - man   + woman ≈ queen
paris - france + germany ≈ berlin
code  - python  + javascript ≈ web_dev

These relationships emerge because words appearing in similar contexts get mapped to nearby points in the embedding space.

Positional Encodings

Since Transformers have no inherent notion of word order, positional encodings are added to token embeddings:

import torch
import math

def sinusoidal_positional_encoding(seq_length, d_model):
    """Generate sinusoidal positional encodings (original Transformer)."""
    position = torch.arange(0, seq_length).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, d_model, 2) * 
                        -(math.log(10000.0) / d_model))
    
    pe = torch.zeros(seq_length, d_model)
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    return pe

# Modern models often use learned positional embeddings instead
learned_pos = nn.Embedding(max_seq_length, d_model)
Positional EncodingDescriptionUsed By
SinusoidalFixed sine/cosine patternsOriginal Transformer
LearnedTrained position embeddingsGPT, BERT, Llama
RoPERotary Position EmbeddingLlama, PaLM, most modern models
ALiBiAttention with Linear BiasesMPT, some efficient models

RoPE (Rotary Position Embedding)

RoPE encodes position through rotation matrices, enabling better length extrapolation:

# RoPE intuition: position is encoded as rotation angle
# token at position 0: [cos(0), sin(0)] = [1, 0]
# token at position 1: [cos(θ), sin(θ)]
# token at position 2: [cos(2θ), sin(2θ)]
# This allows the model to generalize to longer sequences

Standalone Embedding Models

Beyond LLM token embeddings, specialized embedding models encode entire texts into fixed-size vectors for semantic search:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

texts = [
    "Large language models process text through tokens",
    "Neural networks learn representations from data",
    "The weather today is quite pleasant"
]

embeddings = model.encode(texts)
print(f"Embedding shape: {embeddings.shape}")  # (3, 384)

# Compute similarity
from sklearn.metrics.pairwise import cosine_similarity
sim = cosine_similarity([embeddings[0]], [embeddings[1]])
print(f"Similarity (related topics): {sim[0][0]:.3f}")  # ~0.65
sim = cosine_similarity([embeddings[0]], [embeddings[2]])
print(f"Similarity (unrelated): {sim[0][0]:.3f}")       # ~0.15
ModelDimensionsSpeedBest For
text-embedding-3-small (OpenAI)1536Fast (API)General purpose
text-embedding-3-large (OpenAI)3072Fast (API)High accuracy
all-MiniLM-L6-v2384Very fastLocal semantic search
bge-large-en-v1.51024MediumRetrieval for RAG
nomic-embed-text768FastOpen-source alternative
GTE-Qwen2-7B-instruct3584SlowerInstruction-aware embedding

Key Takeaways

  • Tokenization breaks text into subword units; the choice of algorithm affects how the model handles rare words and multilingual text
  • Embeddings map tokens to dense vectors where semantic similarity corresponds to geometric proximity
  • Positional encodings inject sequence order information; RoPE is the modern standard
  • Standalone embedding models are essential for semantic search and RAG pipelines
  • Token counts directly impact cost, latency, and context window usage — always measure tokens, not characters

Related docs