NVIDIA
RTX 4090 24GB
Practical deployment guidance for this machine, using conservative fit estimates instead of marketing-style claims.
- Memory
- 24 GB VRAM
- System RAM
- 64 GB
- Bandwidth
- 1008 GB/s
- Preferred backend
- Ollama
Hardware notes
- A strong local tier for 14B through 32B quantized inference.
- Long context and concurrency still change the fit story dramatically.
Comfortable fits
Llama 3.1 8B Instruct
comfortableQ4 at 8k context is estimated at 6.69 GB total and 90-150 tok/s.
Phi-4
comfortableQ4 at 8k context is estimated at 10.92 GB total and 61-103 tok/s.
Qwen2.5 Coder 14B Instruct
comfortableQ4 at 8k context is estimated at 11.33 GB total and 58-98 tok/s.
Borderline fits
Gemma 3 27B IT
yesThis should fit, but headroom is limited. Keep background concurrency low and avoid stretching context further without re-checking the estimate.
Backend support
These are the runtimes currently associated with this hardware profile.
Ollamallama.cppLM StudioExLlamavLLM
Use the live calculator
The calculator lets you change context length, runtime, quantization, and concurrency instead of relying on a fixed profile.