NVIDIA

RTX 4090 24GB

Practical deployment guidance for this machine, using conservative fit estimates instead of marketing-style claims.

Memory
24 GB VRAM
System RAM
64 GB
Bandwidth
1008 GB/s
Preferred backend
Ollama

Hardware notes

  • A strong local tier for 14B through 32B quantized inference.
  • Long context and concurrency still change the fit story dramatically.

Comfortable fits

Llama 3.1 8B Instruct

comfortable

Q4 at 8k context is estimated at 6.69 GB total and 90-150 tok/s.

Phi-4

comfortable

Q4 at 8k context is estimated at 10.92 GB total and 61-103 tok/s.

Qwen2.5 Coder 14B Instruct

comfortable

Q4 at 8k context is estimated at 11.33 GB total and 58-98 tok/s.

Borderline fits

Gemma 3 27B IT

yes

This should fit, but headroom is limited. Keep background concurrency low and avoid stretching context further without re-checking the estimate.

Backend support

These are the runtimes currently associated with this hardware profile.

Ollamallama.cppLM StudioExLlamavLLM

Use the live calculator

The calculator lets you change context length, runtime, quantization, and concurrency instead of relying on a fixed profile.

Open calculator