Calculator

Can this model run on your machine?

Estimate model weight memory, KV cache growth, runtime overhead, CPU offload, and expected fit before you spend time on a deployment that was never going to work.

Fit verdict

Llama 3.1 8B Instruct on RTX 3060 12GB

comfortable

This setup should run comfortably with Q4 at roughly 8,192 tokens of context for chat workloads.

Weights

4.69 GB

KV cache

1 GB

Runtime overhead

1 GB

Total required

6.69 GB

Usable accelerator budget

10.8 GB

System RAM budget

24.96 GB

Operational breakdown

These numbers are estimates, not promises. They assume one active loaded model and the selected context and concurrency.

Selected backendOllama
Estimated throughput38-65 tok/s
Performance tierhigh
CPU offloadNot expected
Backend noteGood default for simple local deployments and GGUF workflows. Easy to use, but not the leanest serving stack.

Assumptions

  • Weight memory is estimated from parameter count and quantization-specific overhead.
  • KV cache assumes the selected context length and concurrency on one active model replica.
  • A headroom margin is baked into the fit tier instead of promising exact allocator behavior.

Weight-only memory by quantization

This table excludes KV cache and runtime overhead, so it is only the starting point for fit planning.

FP1614.9 GB
BF1614.9 GB
INT87.9 GB
Q88.34 GB
Q66.59 GB
Q55.68 GB
Q44.69 GB

Better fits on this machine

If the current setup is too tight, these are the nearest models that score better on the selected hardware and runtime.

Phi-4

barely

Microsoft

10.92 GB total, 21-37 tok/s

Qwen2.5 Coder 14B Instruct

barely

Alibaba Cloud

11.33 GB total, 20-35 tok/s