Fundamentals
Training Data and Curation
How LLMs are trained on massive datasets — data sources, cleaning pipelines, deduplication, and the evolution of training corpora
Published: 2026-04-04 · Last updated: 2026-04-13
Training Data and Curation
The performance of an LLM is determined not just by its architecture and parameter count, but fundamentally by the quality, diversity, and scale of its training data. This guide explores how training datasets are assembled, cleaned, and curated.
The Data Pipeline
Raw Sources
│
├── Web Crawls (Common Crawl, etc.)
├── Books (Project Gutenberg, licensed)
├── Code (GitHub, StackExchange)
├── Academic Papers (arXiv, PubMed)
└── Curated Content (Wikipedia, etc.)
│
▼
Filtering & Cleaning
│
├── Language Identification
├── Quality Scoring
├── PII Removal
├── Deduplication
└── Toxicity Filtering
│
▼
Tokenization
│
▼
Training
Major Data Sources
Web Crawls
| Dataset | Size | Description |
|---|---|---|
| Common Crawl | 250B+ tokens | Monthly web crawls; raw, noisy |
| C4 (Colossal Clean Crawled) | 156B tokens | Cleaned Common Crawl (T5 training) |
| RefinedWeb | 600B+ tokens | High-quality filtered Common Crawl (Falcon) |
| FineWeb | 15T+ tokens | Aggregated and deduplicated web corpus |
| DCLM | 2.8T tokens | Domain-classified and filtered |
Code
| Dataset | Languages | Tokens |
|---|---|---|
| The Stack | 350+ | 3.1T |
| StarCoderData | 100+ | 0.5T |
| CodeParrot | Python | 180B |
| SlimPajama | Multi | 627B (code subset) |
Books and Long-form
| Dataset | Books | Notes |
|---|---|---|
| Books3 | ~200K | Controversial provenance |
| Project Gutenberg | 70K+ | Public domain |
| OpenWebText2 | — | Reddit-upvoted links |
Multilingual
| Dataset | Languages | Notes |
|---|---|---|
| mC4 | 101 | Multilingual C4 |
| Wikipedia dumps | 300+ | Varies in quality |
| CulturaX | 167 | Cleaned multilingual |
Data Cleaning Pipeline
1. Language Identification
from fasttext import load_model
model = load_model("lid.176.bin")
def filter_by_language(text, target_lang="en", threshold=0.8):
predictions = model.predict(text.replace("\n", " "), k=1)
lang, confidence = predictions[0][0], predictions[1][0]
return lang == f"__label__{target_lang}" and confidence > threshold
2. Quality Filtering
Heuristics-based quality scoring:
def quality_score(text: str) -> float:
score = 0.0
# Penalize very short texts
if len(text) < 100:
score -= 2.0
# Penalize high punctuation density
punct_ratio = text.count(".") / max(len(text.split()), 1)
if punct_ratio > 0.3:
score -= 1.0
# Reward proper sentence endings
if text.strip().endswith(('.', '!', '?')):
score += 0.5
# Penalize boilerplate patterns
boilerplate = ["click here", "subscribe now", "cookie policy"]
for phrase in boilerplate:
if phrase in text.lower():
score -= 0.5
# Reward vocabulary diversity
words = text.split()
if len(words) > 0:
vocab_richness = len(set(words)) / len(words)
score += vocab_richness
return score
Model-based quality scoring uses a small classifier trained on high-quality seed data (Wikipedia, books) to score each document.
3. Deduplication
Duplicates in training data cause models to memorize rather than learn, and inflate evaluation scores.
# Exact deduplication with hashing
import hashlib
def deduplicate_documents(documents: list[str]) -> list[str]:
seen = set()
unique = []
for doc in documents:
doc_hash = hashlib.sha256(doc.encode()).hexdigest()
if doc_hash not in seen:
seen.add(doc_hash)
unique.append(doc)
return unique
# Fuzzy deduplication with MinHash + LSH (for near-duplicates)
from datasketch import MinHash, MinHashLSH
# More scalable: Bloom filters + n-gram hashing
# Used by: RefinedWeb, FineWeb, DCLM
4. PII and Sensitive Data Removal
import re
def remove_pii(text: str) -> str:
# Email addresses
text = re.sub(r'[\w.+-]+@[\w-]+\.[\w.-]+', '[EMAIL]', text)
# Phone numbers (US format)
text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', text)
# SSNs
text = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', text)
# IP addresses
text = re.sub(r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b', '[IP]', text)
return text
Training Data Composition
Typical composition for a strong general-purpose model:
| Category | Percentage | Purpose |
|---|---|---|
| Web text | 50–70% | General knowledge, language patterns |
| Code | 5–15% | Logical reasoning, structured thinking |
| Books | 5–10% | Long-form coherence, narrative |
| Academic/Scientific | 5–10% | Technical knowledge, factual accuracy |
| Conversational | 3–8% | Dialogue ability, instruction following |
| Multilingual | 5–15% | Cross-lingual ability |
The Scaling Law for Data
Research has established scaling laws that relate model performance to data size:
Loss(N, D) = E + A/N^α + B/D^β
Where:
N = model parameters
D = training tokens
E, A, B, α, β = fitted constants
Key insight from Chinchilla (2022): model size and training data should scale proportionally. A 70B model needs ~1.4T tokens for optimal training, not the ~300B used for GPT-3 (175B).
Modern Training Recommendations
| Model Size | Recommended Training Tokens |
|---|---|
| 1B | 20B |
| 7B | 140B |
| 13B | 260B |
| 70B | 1.4T |
| 400B+ | 8T+ |
Data Contamination and Benchmark Leakage
When training data contains examples of benchmark test questions, evaluation scores become inflated. Modern datasets address this with:
- N-gram overlap filtering against known benchmarks
- Deduplication at the document and near-duplicate level
- Benchmark-aware evaluation reporting both contaminated and decontaminated scores
Open Training Datasets
For researchers and teams who want transparency:
| Dataset | Size | License | Notes |
|---|---|---|---|
| Dolma | 3T | ODC-BY | AI2's open corpus |
| FineWeb | 15T | MIT-like | Hugging Face curated |
| FineWeb-Edu | 1.3T | MIT-like | Educational subset |
| RefinedWeb | 600B | Open | Falcon's data |
| DCLM | 2.8T | Various | Domain-classified |
| OLMo Corpus | 3T | Apache 2.0 | Fully open training data |
Key Takeaways
- Training data quality matters as much as model architecture
- Deduplication and filtering are critical for preventing memorization
- The Chinchilla scaling law recommends proportional scaling of parameters and data
- Benchmark contamination is a real problem addressed through careful filtering
- Open datasets (Dolma, FineWeb, OLMo corpus) enable reproducible research
Related Documentation
- Model Training and Pre-training — The training process itself
- Tokenization and Embeddings — How data is converted to model input
- Evaluation Metrics — How we measure whether data quality matters