Quick answer
Start with model size, then add headroom. A rough 4-bit estimate is about 0.55GB per billion parameters for weights only; FP16 is about 2.05GB per billion.
If you want a simple buying rule: 24GB VRAM is a strong local experimentation tier; 48GB+ is where larger open models become much more comfortable.
Hardware tiers
| Tier | Practical range | What to watch |
|---|---|---|
| 8GB VRAM | Small 3B-8B quantized models | Good for experiments, chat, and light coding. Keep context modest. |
| 12GB VRAM | Many 7B-12B quantized models | More comfortable for longer prompts and better quantization choices. |
| 16GB VRAM | 7B-14B quantized models, some 20B-class with tradeoffs | A useful baseline for local development. |
| 24GB VRAM | 14B-32B quantized models | Strong single-GPU tier for quality local testing. |
| 48GB VRAM | 32B-70B quantized models | Better for larger coding/reasoning models and longer contexts. |
| 80GB VRAM | Large dense models or bigger MoE checkpoints | Serious workstation/server tier; still watch KV cache. |
| Mac unified memory | Depends on free unified memory and backend support | Do not map 1:1 to GPU VRAM; leave headroom for the OS. |
Parameters, active parameters, and context
Parameter count is the fastest rough proxy for memory. A dense 32B model usually needs more memory than a dense 8B model. MoE models add a wrinkle: total parameters describe the full checkpoint, while active parameters describe the portion used per token.
Context window is the other major signal. Longer contexts increase KV-cache memory and latency. A model can fit at short context and become impractical at very long context.
Catalog examples with rough memory
| Model | Params | Context | 4-bit rough | FP16 rough | Released |
|---|---|---|---|---|---|
| LFM2 1.2B | 1.17B | 33K | 1GB | 3GB | Nov 28, 2025 |
| SmolLM2 1.7B | 1.7B | — | 1GB | 4GB | Nov 4, 2024 |
| Stable LM 2 1.6B | 1.6B | — | 1GB | 4GB | Jan 19, 2024 |
| TinyLlama 1.1B Chat | 1.1B | — | 1GB | 3GB | Jan 1, 2024 |
| Phi-1 | 1.3B | — | 1GB | 3GB | Jun 21, 2023 |
| GPT-2 | 1.5B | 1K | 1GB | 4GB | Nov 5, 2019 |
| BERT | 0.34B | 512 | 1GB | 1GB | Oct 11, 2018 |
| SmolLM3 3B | 3B | 128K | 2GB | 7GB | Jul 8, 2025 |
These estimates are weight-only. Real deployments need extra memory for runtime overhead, KV cache, batching, and the operating system.
Frequently asked questions
How much VRAM do I need to run a local LLM?
It depends on model size, quantization, context length, and serving overhead. As a rough starting point, 8GB is for small models, 16GB opens up many 7B-14B models, 24GB is a strong hobbyist tier, and 48GB+ is better for larger models.
Do active parameters matter for MoE models?
Yes for compute per token, but total parameters still matter for storing and loading the full checkpoint. Treat active parameters as a throughput clue, not a complete memory estimate.
Why can a model fit but still run poorly?
Weight memory is only part of the requirement. Long contexts, KV cache, batching, CPU offload, runtime overhead, and slow memory bandwidth can make a technically fitting model impractical.
Where to go next
Use the local LLM shortlist for model-by-model rough estimates, browse the broader local LLM catalog, or compare coding-focused downloadable models in open coding models.