How Many Parameters Does BERT-base Have?

BERT-base has approximately 110 million parameters (~109.5M). 12 transformer layers, hidden size 768, 12 attention heads, vocabulary 30,522.

BERT-base Configuration

Hidden size (H):        768
Intermediate size:      3072  (4 * H)
Attention heads:        12
Layers:                 12
Vocabulary size:        30,522
Max position:           512

Parameter Breakdown

Component                    | Parameters
-----------------------------|-------------
Word Embeddings              | 23,440,896   # 30522 * 768
Position Embeddings          |    393,216   # 512 * 768
Segment Embeddings           |      1,536   # 2 * 768
Embedding LayerNorm          |      1,536   # 2 * 768

Per Transformer Layer:
  Self-Attention (Q,K,V,O)   |  2,362,368   # 4 * (768*768 + 768)
  Attention LayerNorm        |      1,536   # 2 * 768
  Feed-Forward (up + down)   |  4,722,432   # 768*3072 + 3072 + 3072*768 + 768
  FFN LayerNorm              |      1,536   # 2 * 768
  Per-layer total            |  7,087,872

12 Transformer Layers        | 85,054,464   # 12 * 7,087,872

Pooler (768 → 768)           |    590,592   # 768*768 + 768
-----------------------------|-------------
Total                        | ~109,482,240

BERT Model Family

BERT-tiny:    4.4M params   (2 layers,  128 hidden)
BERT-mini:    11.2M params  (4 layers,  256 hidden)
BERT-small:   28.8M params  (4 layers,  512 hidden)
BERT-medium:  41.4M params  (8 layers,  512 hidden)
BERT-base:   109.5M params  (12 layers, 768 hidden)
BERT-large:  335.1M params  (24 layers, 1024 hidden)

Memory Requirements

FP32 inference:  ~418 MB  (params only)
FP16 inference:  ~209 MB
Training (Adam): ~1.67 GB (params + grads + 2 optimizer states)

Related Questions

Try the Parameter Counter