Fundamentals

Training Data and Curation

How LLMs are trained on massive datasets — data sources, cleaning pipelines, deduplication, and the evolution of training corpora

Published: 2026-04-04 · Last updated: 2026-04-13

Training Data and Curation

The performance of an LLM is determined not just by its architecture and parameter count, but fundamentally by the quality, diversity, and scale of its training data. This guide explores how training datasets are assembled, cleaned, and curated.

The Data Pipeline

Raw Sources
    │
    ├── Web Crawls (Common Crawl, etc.)
    ├── Books (Project Gutenberg, licensed)
    ├── Code (GitHub, StackExchange)
    ├── Academic Papers (arXiv, PubMed)
    └── Curated Content (Wikipedia, etc.)
         │
         ▼
    Filtering & Cleaning
    │
    ├── Language Identification
    ├── Quality Scoring
    ├── PII Removal
    ├── Deduplication
    └── Toxicity Filtering
         │
         ▼
    Tokenization
    │
         ▼
    Training

Major Data Sources

Web Crawls

DatasetSizeDescription
Common Crawl250B+ tokensMonthly web crawls; raw, noisy
C4 (Colossal Clean Crawled)156B tokensCleaned Common Crawl (T5 training)
RefinedWeb600B+ tokensHigh-quality filtered Common Crawl (Falcon)
FineWeb15T+ tokensAggregated and deduplicated web corpus
DCLM2.8T tokensDomain-classified and filtered

Code

DatasetLanguagesTokens
The Stack350+3.1T
StarCoderData100+0.5T
CodeParrotPython180B
SlimPajamaMulti627B (code subset)

Books and Long-form

DatasetBooksNotes
Books3~200KControversial provenance
Project Gutenberg70K+Public domain
OpenWebText2Reddit-upvoted links

Multilingual

DatasetLanguagesNotes
mC4101Multilingual C4
Wikipedia dumps300+Varies in quality
CulturaX167Cleaned multilingual

Data Cleaning Pipeline

1. Language Identification

from fasttext import load_model

model = load_model("lid.176.bin")

def filter_by_language(text, target_lang="en", threshold=0.8):
    predictions = model.predict(text.replace("\n", " "), k=1)
    lang, confidence = predictions[0][0], predictions[1][0]
    return lang == f"__label__{target_lang}" and confidence > threshold

2. Quality Filtering

Heuristics-based quality scoring:

def quality_score(text: str) -> float:
    score = 0.0
    
    # Penalize very short texts
    if len(text) < 100:
        score -= 2.0
    
    # Penalize high punctuation density
    punct_ratio = text.count(".") / max(len(text.split()), 1)
    if punct_ratio > 0.3:
        score -= 1.0
    
    # Reward proper sentence endings
    if text.strip().endswith(('.', '!', '?')):
        score += 0.5
    
    # Penalize boilerplate patterns
    boilerplate = ["click here", "subscribe now", "cookie policy"]
    for phrase in boilerplate:
        if phrase in text.lower():
            score -= 0.5
    
    # Reward vocabulary diversity
    words = text.split()
    if len(words) > 0:
        vocab_richness = len(set(words)) / len(words)
        score += vocab_richness
    
    return score

Model-based quality scoring uses a small classifier trained on high-quality seed data (Wikipedia, books) to score each document.

3. Deduplication

Duplicates in training data cause models to memorize rather than learn, and inflate evaluation scores.

# Exact deduplication with hashing
import hashlib

def deduplicate_documents(documents: list[str]) -> list[str]:
    seen = set()
    unique = []
    for doc in documents:
        doc_hash = hashlib.sha256(doc.encode()).hexdigest()
        if doc_hash not in seen:
            seen.add(doc_hash)
            unique.append(doc)
    return unique

# Fuzzy deduplication with MinHash + LSH (for near-duplicates)
from datasketch import MinHash, MinHashLSH

# More scalable: Bloom filters + n-gram hashing
# Used by: RefinedWeb, FineWeb, DCLM

4. PII and Sensitive Data Removal

import re

def remove_pii(text: str) -> str:
    # Email addresses
    text = re.sub(r'[\w.+-]+@[\w-]+\.[\w.-]+', '[EMAIL]', text)
    # Phone numbers (US format)
    text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', text)
    # SSNs
    text = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', text)
    # IP addresses
    text = re.sub(r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b', '[IP]', text)
    return text

Training Data Composition

Typical composition for a strong general-purpose model:

CategoryPercentagePurpose
Web text50–70%General knowledge, language patterns
Code5–15%Logical reasoning, structured thinking
Books5–10%Long-form coherence, narrative
Academic/Scientific5–10%Technical knowledge, factual accuracy
Conversational3–8%Dialogue ability, instruction following
Multilingual5–15%Cross-lingual ability

The Scaling Law for Data

Research has established scaling laws that relate model performance to data size:

Loss(N, D) = E + A/N^α + B/D^β

Where:
  N = model parameters
  D = training tokens
  E, A, B, α, β = fitted constants

Key insight from Chinchilla (2022): model size and training data should scale proportionally. A 70B model needs ~1.4T tokens for optimal training, not the ~300B used for GPT-3 (175B).

Modern Training Recommendations

Model SizeRecommended Training Tokens
1B20B
7B140B
13B260B
70B1.4T
400B+8T+

Data Contamination and Benchmark Leakage

When training data contains examples of benchmark test questions, evaluation scores become inflated. Modern datasets address this with:

  1. N-gram overlap filtering against known benchmarks
  2. Deduplication at the document and near-duplicate level
  3. Benchmark-aware evaluation reporting both contaminated and decontaminated scores

Open Training Datasets

For researchers and teams who want transparency:

DatasetSizeLicenseNotes
Dolma3TODC-BYAI2's open corpus
FineWeb15TMIT-likeHugging Face curated
FineWeb-Edu1.3TMIT-likeEducational subset
RefinedWeb600BOpenFalcon's data
DCLM2.8TVariousDomain-classified
OLMo Corpus3TApache 2.0Fully open training data

Key Takeaways

  • Training data quality matters as much as model architecture
  • Deduplication and filtering are critical for preventing memorization
  • The Chinchilla scaling law recommends proportional scaling of parameters and data
  • Benchmark contamination is a real problem addressed through careful filtering
  • Open datasets (Dolma, FineWeb, OLMo corpus) enable reproducible research

Related docs