How to size local LLM hardware

Quick answer

Start with model size, then add headroom. A rough 4-bit estimate is about 0.55GB per billion parameters for weights only; FP16 is about 2.05GB per billion.

If you want a simple buying rule: 24GB VRAM is a strong local experimentation tier; 48GB+ is where larger open models become much more comfortable.

Hardware tiers

Tier	Practical range	What to watch
8GB VRAM	Small 3B-8B quantized models	Good for experiments, chat, and light coding. Keep context modest.
12GB VRAM	Many 7B-12B quantized models	More comfortable for longer prompts and better quantization choices.
16GB VRAM	7B-14B quantized models, some 20B-class with tradeoffs	A useful baseline for local development.
24GB VRAM	14B-32B quantized models	Strong single-GPU tier for quality local testing.
48GB VRAM	32B-70B quantized models	Better for larger coding/reasoning models and longer contexts.
80GB VRAM	Large dense models or bigger MoE checkpoints	Serious workstation/server tier; still watch KV cache.
Mac unified memory	Depends on free unified memory and backend support	Do not map 1:1 to GPU VRAM; leave headroom for the OS.

Parameters, active parameters, and context

Parameter count is the fastest rough proxy for memory. A dense 32B model usually needs more memory than a dense 8B model. MoE models add a wrinkle: total parameters describe the full checkpoint, while active parameters describe the portion used per token.

Context window is the other major signal. Longer contexts increase KV-cache memory and latency. A model can fit at short context and become impractical at very long context.

Catalog examples with rough memory

Model	Params	Context	4-bit rough	FP16 rough	Released
LFM2 1.2B	1.17B	33K	1GB	3GB	Nov 28, 2025
SmolLM2 1.7B	1.7B	—	1GB	4GB	Nov 4, 2024
Stable LM 2 1.6B	1.6B	—	1GB	4GB	Jan 19, 2024
TinyLlama 1.1B Chat	1.1B	—	1GB	3GB	Jan 1, 2024
Phi-1	1.3B	—	1GB	3GB	Jun 21, 2023
GPT-2	1.5B	1K	1GB	4GB	Nov 5, 2019
BERT	0.34B	512	1GB	1GB	Oct 11, 2018
SmolLM3 3B	3B	128K	2GB	7GB	Jul 8, 2025

These estimates are weight-only. Real deployments need extra memory for runtime overhead, KV cache, batching, and the operating system.

Frequently asked questions

How much VRAM do I need to run a local LLM?

It depends on model size, quantization, context length, and serving overhead. As a rough starting point, 8GB is for small models, 16GB opens up many 7B-14B models, 24GB is a strong hobbyist tier, and 48GB+ is better for larger models.

Do active parameters matter for MoE models?

Yes for compute per token, but total parameters still matter for storing and loading the full checkpoint. Treat active parameters as a throughput clue, not a complete memory estimate.

Why can a model fit but still run poorly?

Weight memory is only part of the requirement. Long contexts, KV cache, batching, CPU offload, runtime overhead, and slow memory bandwidth can make a technically fitting model impractical.

Where to go next

Use the local LLM shortlist for model-by-model rough estimates, browse the broader local LLM catalog, or compare coding-focused downloadable models in open coding models.