How Many Parameters Does BERT-base Have?
BERT-base has approximately 110 million parameters (~109.5M). 12 transformer layers, hidden size 768, 12 attention heads, vocabulary 30,522.
BERT-base Configuration
Hidden size (H): 768
Intermediate size: 3072 (4 * H)
Attention heads: 12
Layers: 12
Vocabulary size: 30,522
Max position: 512
Parameter Breakdown
Component | Parameters
-----------------------------|-------------
Word Embeddings | 23,440,896 # 30522 * 768
Position Embeddings | 393,216 # 512 * 768
Segment Embeddings | 1,536 # 2 * 768
Embedding LayerNorm | 1,536 # 2 * 768
Per Transformer Layer:
Self-Attention (Q,K,V,O) | 2,362,368 # 4 * (768*768 + 768)
Attention LayerNorm | 1,536 # 2 * 768
Feed-Forward (up + down) | 4,722,432 # 768*3072 + 3072 + 3072*768 + 768
FFN LayerNorm | 1,536 # 2 * 768
Per-layer total | 7,087,872
12 Transformer Layers | 85,054,464 # 12 * 7,087,872
Pooler (768 → 768) | 590,592 # 768*768 + 768
-----------------------------|-------------
Total | ~109,482,240
BERT Model Family
BERT-tiny: 4.4M params (2 layers, 128 hidden)
BERT-mini: 11.2M params (4 layers, 256 hidden)
BERT-small: 28.8M params (4 layers, 512 hidden)
BERT-medium: 41.4M params (8 layers, 512 hidden)
BERT-base: 109.5M params (12 layers, 768 hidden)
BERT-large: 335.1M params (24 layers, 1024 hidden)
Memory Requirements
FP32 inference: ~418 MB (params only)
FP16 inference: ~209 MB
Training (Adam): ~1.67 GB (params + grads + 2 optimizer states)