How Many Parameters Does Linear(4096, 4096) Have?

Q: How Many Parameters Does Linear(4096, 4096) Have?

Linear(4096, 4096) has 16,781,312 parameters. The formula is: in_features * out_features + out_features (bias) = 4096 * 4096 + 4096 = 16,781,312. The weight matrix has 16,777,216 elements and the bias vector has 4096 elements.

Linear(4096, 4096) has 16,781,312 trainable parameters. This includes 16,777,216 weights and 4096 bias terms.

Formula Breakdown

For a Linear layer, the parameter count is:

parameters = in_features * out_features + out_features (bias)
parameters = 4096 * 4096 + 4096
parameters = 16,777,216 + 4096
parameters = 16,781,312

The weight matrix W has shape (4096, 4096) = 16,777,216 values. The bias vector b has 4096 values. Together: 16,781,312 trainable parameters.

Memory Usage

In float32, this layer uses 64.02 MB of memory for weights alone. During training with Adam optimizer, multiply by 3 (weights + momentum + variance) = 192.05 MB.

Architecture Context

This layer configuration is found in VGG-16 and VGG-19 fully connected layers (fc6 and fc7). Understanding parameter counts helps you estimate model size, memory requirements, and the risk of overfitting. Layers with more parameters need more training data and compute to train effectively.

Linear layers are often the most parameter-heavy part of a network. For example, VGG-16 has ~124M parameters in its three fully connected layers versus only ~14M in all its convolutional layers. Modern architectures minimize linear layers by using global average pooling.

PyTorch Code to Verify

import torch.nn as nn

layer = nn.Linear(4096, 4096)

# Count parameters
total = sum(p.numel() for p in layer.parameters())
print(f"Total parameters: {total}")  # 16,781,312

# Break it down
print(f"Weight shape: {layer.weight.shape}")  # (4096, 4096)
print(f"Weight params: {layer.weight.numel()}")  # 16,777,216
print(f"Bias shape: {layer.bias.shape}")  # (4096,)
print(f"Bias params: {layer.bias.numel()}")  # 4096

# Without bias
layer_no_bias = nn.Linear(4096, 4096, bias=False)
print(f"Without bias: {sum(p.numel() for p in layer_no_bias.parameters())}")  # 16,777,216

Comparison: With vs. Without Bias

Configuration	Parameters
Linear(4096, 4096) (with bias)	16,781,312
Linear(4096, 4096, bias=False)	16,777,216

When using BatchNorm after a convolutional layer, the bias is redundant because BatchNorm has its own bias term. Setting bias=False saves 4096 parameters per layer.

Try the Parameter Counter