NVIDIA
RTX 3060 12GB
Practical deployment guidance for this machine, using conservative fit estimates instead of marketing-style claims.
- Memory
- 12 GB VRAM
- System RAM
- 32 GB
- Bandwidth
- 360 GB/s
- Preferred backend
- Ollama
Hardware notes
- Good entry point for 7B and 8B local inference.
- 14B can work with Q4 and short context, but headroom is limited.
Comfortable fits
Llama 3.1 8B Instruct
comfortableQ4 at 8k context is estimated at 6.69 GB total and 38-65 tok/s.
Borderline fits
Phi-4
barelyThis is a tight fit. Expect CPU offload or memory pressure of roughly 0.12 GB, and consider a lighter quant or shorter context.
Qwen2.5 Coder 14B Instruct
barelyThis is a tight fit. Expect CPU offload or memory pressure of roughly 0.53 GB, and consider a lighter quant or shorter context.
Backend support
These are the runtimes currently associated with this hardware profile.
Ollamallama.cppLM StudioExLlama
Use the live calculator
The calculator lets you change context length, runtime, quantization, and concurrency instead of relying on a fixed profile.