NVIDIA

RTX 4090 24GB

Practical deployment guidance for this machine, using conservative fit estimates instead of marketing-style claims.

Hardware notes

Llama 3.1 8B Instruct

comfortable

Q4 at 8k context is estimated at 6.69 GB total and 90-150 tok/s.

Phi-4

comfortable

Q4 at 8k context is estimated at 10.92 GB total and 61-103 tok/s.

Qwen2.5 Coder 14B Instruct

comfortable

Q4 at 8k context is estimated at 11.33 GB total and 58-98 tok/s.

Gemma 3 27B IT

yes

This should fit, but headroom is limited. Keep background concurrency low and avoid stretching context further without re-checking the estimate.

These are the runtimes currently associated with this hardware profile.

Ollamallama.cppLM StudioExLlamavLLM

The calculator lets you change context length, runtime, quantization, and concurrency instead of relying on a fixed profile.