Training Lab
See how an LLM changes during training, fine-tuning, and inference
Explore a visual simulation of weight fields, optimizer state, activations, KV cache growth, and parameter scaling. The goal is to make the mechanics legible: what the network is storing, what changes during updates, and why bigger models physically cost more to train and serve.
Interactive stages
What an LLM is doing at each phase
This is a conceptual simulation, not a literal dump of a production model. It shows how weights, gradients, activations, and cache behavior change from pretraining to fine-tuning to inference.
Model state
Random-to-structured
The network begins with random weights. Each batch nudges billions of values until token prediction becomes statistically useful.
Primary runtime truth
Backward pass updates almost every trainable weight.
Optimizer state and gradients dominate memory.
Weight field simulation
7B model · Pretraining
Loss
4.5
Lower is usually better.
Update strength
69%
How hard weights are moving.
Attention spread
37%
How broad the active context is.
Parameter scale
7B parameters
Moving this slider increases weights, memory residency, and compute pressure.
Sequence length
4,096 tokens
Longer contexts increase activation load and KV cache growth.
Training progress
28%
Earlier steps look noisy; later steps look more structured.
Weight precision
FP16
Precision changes how large the model feels in memory at inference time.
Estimated layers
48
A rough scaling proxy.
Hidden size
3,904
Wider models carry larger matrices.
Attention heads
31
More heads raise attention routing capacity.
Weights in memory
14 GB
FP16 residency only.
Training-only state
70 GB
Optimizer + gradients.
Current working set
87.1 GB
Weights + activations + cache + updates.
Parameter growth
How bigger models physically change
As parameter count rises, you are not just getting “more intelligence.” You are increasing weight memory, optimizer state, training throughput requirements, and often the minimum useful hardware envelope.
1B
2 GB weights3B
6 GB weights7B
14 GB weights13B
26 GB weights34B
68 GB weights70B
140 GB weights180B
360 GB weights405B
810 GB weightsLive explanation
What changes between stages
Pretraining
At 7B parameters, pretraining usually means huge token volume, large optimizer state, and a network that is still learning generic statistical structure. Approximate data exposure here: 210B tokens.
Fine-tuning
Fine-tuning does not rebuild the model from zero. It nudges the existing weights toward a narrower behavior. The model already has broad language priors, so the update field becomes more targeted.
Inference
Inference freezes the weights. The model size still matters because those weights must stay resident, but now the fast-growing object is the KV cache, which expands with context length and concurrency.
Parameters
A “parameter” is just a learned scalar in the network. When you move from 7B to 70B, you are scaling the number of learned values by 10x, which often pushes training state and serving memory up by similar orders of magnitude.
Current estimate
Token exposure
210B
Activation load
2.4 GB
KV cache
0.7 GB
Trainable update load
70 GB