Calculator
Can this model run on your machine?
Estimate model weight memory, KV cache growth, runtime overhead, CPU offload, and expected fit before you spend time on a deployment that was never going to work.
Fit verdict
Llama 3.1 8B Instruct on RTX 3060 12GB
This setup should run comfortably with Q4 at roughly 8,192 tokens of context for chat workloads.
Weights
4.69 GB
KV cache
1 GB
Runtime overhead
1 GB
Total required
6.69 GB
Usable accelerator budget
10.8 GB
System RAM budget
24.96 GB
Operational breakdown
These numbers are estimates, not promises. They assume one active loaded model and the selected context and concurrency.
| Selected backend | Ollama |
| Estimated throughput | 38-65 tok/s |
| Performance tier | high |
| CPU offload | Not expected |
| Backend note | Good default for simple local deployments and GGUF workflows. Easy to use, but not the leanest serving stack. |
Assumptions
- Weight memory is estimated from parameter count and quantization-specific overhead.
- KV cache assumes the selected context length and concurrency on one active model replica.
- A headroom margin is baked into the fit tier instead of promising exact allocator behavior.
Weight-only memory by quantization
This table excludes KV cache and runtime overhead, so it is only the starting point for fit planning.
Better fits on this machine
If the current setup is too tight, these are the nearest models that score better on the selected hardware and runtime.
Phi-4
barelyMicrosoft
10.92 GB total, 21-37 tok/s
Qwen2.5 Coder 14B Instruct
barelyAlibaba Cloud
11.33 GB total, 20-35 tok/s