MultiHead Attention Shape Calculator
Calculate the output shape of PyTorch MultiheadAttention. Enter embed_dim, num_heads, and sequence length to verify your transformer layer configuration.
Built by Michael Lip
Frequently Asked Questions
What is the output shape of MultiheadAttention?
For input [seq_len, batch, embed_dim] (PyTorch default) or [batch, seq_len, embed_dim] (with batch_first=True), the output attention shape is the same: the embed_dim dimension is preserved. The attention weights have shape [batch, num_heads, seq_len, seq_len].
Why must embed_dim be divisible by num_heads?
Each head operates on embed_dim / num_heads dimensions. If embed_dim=512 and num_heads=8, each head processes 64 dimensions. If this division isn't even, PyTorch raises an error.
How many parameters does MultiheadAttention have?
With in_proj (default), it has 3 * embed_dim * embed_dim (for Q, K, V projections) + embed_dim * embed_dim (output projection) + biases. For embed_dim=512, that's about 1.05M parameters.
About This Tool
This tool is part of HeyTensor, a free suite of PyTorch and deep learning utilities. All calculations run entirely in your browser — no data is sent to any server. The source code is open on GitHub.
Contact
HeyTensor is built and maintained by Michael Lip. For questions or feedback, email [email protected].