NVIDIA

A100 80GB

Practical deployment guidance for this machine, using conservative fit estimates instead of marketing-style claims.

Hardware notes

Llama 3.1 8B Instruct

comfortable

Q4 at 8k context is estimated at 6.44 GB total and 90-150 tok/s.

Phi-4

comfortable

Q4 at 8k context is estimated at 10.86 GB total and 90-150 tok/s.

Qwen2.5 Coder 14B Instruct

comfortable

Q4 at 8k context is estimated at 11.3 GB total and 90-150 tok/s.

Gemma 3 27B IT

comfortable

Q4 at 8k context is estimated at 20.6 GB total and 73-123 tok/s.

This machine either fits the small set comfortably or falls straight into no-fit territory.

These are the runtimes currently associated with this hardware profile.

vLLMllama.cpp

The calculator lets you change context length, runtime, quantization, and concurrency instead of relying on a fixed profile.