VRAM Reference

LLM VRAM Requirements

Memory footprint by quantisation level — Q1 through FP32

Colour scale: Under 20 GB — consumer GPU 20–99 GB — workstation / server 100–499 GB — multi-GPU server 500 GB+ — cluster territory
GB
Showing 65 / 65 models
DeepSeek Llama Mistral Phi Qwen Gemma GLM
Quantisation (INT) Full Precision
Model Params Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 FP16 FP32
1.6 bit 2.5 bit 3.5 bit 4.5 bit 5.5 bit 6.5 bit 7.5 bit 8.5 bit 16 bit 32 bit
How It's Calculated

VRAM Estimation Formula

VRAM Calculator

Enter a parameter count to estimate VRAM across all quant levels
%
Enter a parameter count above to calculate VRAM requirements

The Formula

VRAM = params × bits_per_weight ÷ 8 × 1.18

Where:
params = total parameter count
bits_per_weight = bits per weight (see table below)
1.18 = +18% overhead (KV cache + activations)

Each weight is stored as a fixed number of bits depending on quantisation level. Dividing by 8 converts bits → bytes, then dividing by 1.073×10⁹ converts to GB. A flat 18% overhead is added to account for the KV cache, activation buffers, and CUDA runtime costs at a typical inference context length.

Worked Example — Llama-3.1-8B at Q4

params = 8,000,000,000
bits_per_weight = 4.5 (Q4)

raw bytes = 8.0×10⁹ × 4.5 ÷ 8
= 4,500,000,000 bytes

+ 18% overhead × 1.18
= 5,310,000,000 bytes

÷ 1,073,741,824 (bytes per GB)
≈ 4.95 GB → table shows 4.2 GB (Q4)
MoE models (Mixtral, DeepSeek-V3, Llama 4): all expert weights must be loaded into VRAM even though only a subset activates per token. VRAM is estimated using the total parameter count, not the active count.