Training Lab

See how an LLM changes during training, fine-tuning, and inference

Explore a visual simulation of weight fields, optimizer state, activations, KV cache growth, and parameter scaling. The goal is to make the mechanics legible: what the network is storing, what changes during updates, and why bigger models physically cost more to train and serve.

Interactive stages

What an LLM is doing at each phase

This is a conceptual simulation, not a literal dump of a production model. It shows how weights, gradients, activations, and cache behavior change from pretraining to fine-tuning to inference.

Model state

Random-to-structured

The network begins with random weights. Each batch nudges billions of values until token prediction becomes statistically useful.

Primary runtime truth

Backward pass updates almost every trainable weight.

Optimizer state and gradients dominate memory.

Weight field simulation

7B model · Pretraining

Step 28

Loss

4.5

Lower is usually better.

Update strength

69%

How hard weights are moving.

Attention spread

37%

How broad the active context is.

Parameter scale

7B parameters

Moving this slider increases weights, memory residency, and compute pressure.

Sequence length

4,096 tokens

Longer contexts increase activation load and KV cache growth.

Training progress

28%

Earlier steps look noisy; later steps look more structured.

Weight precision

FP16

Precision changes how large the model feels in memory at inference time.

Estimated layers

A rough scaling proxy.

Hidden size

3,904

Wider models carry larger matrices.

Attention heads

More heads raise attention routing capacity.

Weights in memory

14 GB

FP16 residency only.

Training-only state

70 GB

Optimizer + gradients.

Current working set

87.1 GB

Weights + activations + cache + updates.

Parameter growth

How bigger models physically change

As parameter count rises, you are not just getting “more intelligence.” You are increasing weight memory, optimizer state, training throughput requirements, and often the minimum useful hardware envelope.

1B

2 GB weights

Inference footprint2.6 GB

Training footprint12 GB

3B

6 GB weights

Inference footprint7.7 GB

Training footprint36 GB

7B

14 GB weights

Inference footprint17.9 GB

Training footprint84 GB

13B

26 GB weights

Inference footprint33.2 GB

Training footprint156 GB

34B

68 GB weights

Inference footprint86.7 GB

Training footprint408 GB

70B

140 GB weights

Inference footprint178.5 GB

Training footprint840 GB

180B

360 GB weights

Inference footprint459 GB

Training footprint2160 GB

405B

810 GB weights

Inference footprint1032.8 GB

Training footprint4860 GB

Live explanation

What changes between stages

Pretraining

At 7B parameters, pretraining usually means huge token volume, large optimizer state, and a network that is still learning generic statistical structure. Approximate data exposure here: 210B tokens.

Fine-tuning

Fine-tuning does not rebuild the model from zero. It nudges the existing weights toward a narrower behavior. The model already has broad language priors, so the update field becomes more targeted.

Inference

Inference freezes the weights. The model size still matters because those weights must stay resident, but now the fast-growing object is the KV cache, which expands with context length and concurrency.

Parameters

A “parameter” is just a learned scalar in the network. When you move from 7B to 70B, you are scaling the number of learned values by 10x, which often pushes training state and serving memory up by similar orders of magnitude.

Current estimate

Token exposure

210B

Activation load

2.4 GB

KV cache

0.7 GB

Trainable update load

70 GB