The 20 Most Common PyTorch Errors
Ranked by frequency from Stack Overflow data analysis. Each error includes the exact message, why it happens, how to fix it, and how to prevent it. Stop guessing and fix errors in seconds.
By Michael Lip · April 7, 2026 · Based on analysis of 300+ Stack Overflow questions
Jump to Error
- #1 mat1 and mat2 shapes cannot be multiplied
- #2 CUDA out of memory
- #3 Expected all tensors on same device
- #4 Expected 4-dimensional input
- #5 view size not compatible
- #6 expected scalar type Long but found Float
- #7 inplace operation gradient error
- #8 Kernel size can't be greater than input
- #9 Expected input batch_size to match target
- #10 backward through graph a second time
- #11 expected scalar type Float but found Half
- #12 shape is invalid for input of size N
- #13 device-side assert triggered
- #14 Expected hidden size mismatch (LSTM)
- #15 does not require grad and has no grad_fn
- #16 tensor size mismatch at non-singleton dim
- #17 embed_dim must be divisible by num_heads
- #18 grad only for scalar outputs
- #19 expected channels but got N channels
- #20 Deserialize on CUDA but is_available False
mat1 and mat2 shapes cannot be multiplied
Why It Happens
A nn.Linear(in_features, out_features) layer performs matrix multiplication: output = input @ weight.T. The input's last dimension must equal in_features. This error occurs when they don't match, most commonly at the transition from convolutional layers to fully-connected layers. The flattened feature count depends on input spatial dimensions, kernel sizes, strides, and padding -- getting any one wrong cascades to the Linear layer.
The Fix
# Step 1: Find the actual flattened size
dummy = torch.zeros(1, 3, 32, 32)
dummy = self.features(dummy) # run through conv layers
print(dummy.shape) # e.g., [1, 64, 4, 4]
flat_size = dummy.view(1, -1).shape[1] # 1024
# Step 2: Set Linear in_features to match
self.classifier = nn.Sequential(
nn.Flatten(),
nn.Linear(flat_size, 256), # flat_size, not a guess
nn.ReLU(),
nn.Linear(256, 10)
)
# Or use LazyLinear (infers in_features automatically):
self.fc = nn.LazyLinear(10) # in_features set on first forward
nn.LazyLinear to defer shape inference to runtime.CUDA out of memory
Why It Happens
GPU memory is finite. During training, memory is consumed by: model parameters (weights), gradients (same size as parameters), optimizer states (1-2x parameter size for Adam), forward activations (proportional to batch size and network depth), and PyTorch's caching allocator overhead. A model that fits in memory for inference may OOM during training because gradients and optimizer states multiply memory usage by 3-4x.
The Fix
# Solution 1: Reduce batch size (simplest)
loader = DataLoader(dataset, batch_size=8) # was 32
# Solution 2: Mixed precision training (halves activation memory)
scaler = torch.cuda.amp.GradScaler()
for x, y in loader:
with torch.cuda.amp.autocast():
loss = model(x, y)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
# Solution 3: Gradient accumulation (effective large batch)
accumulation_steps = 4
for i, (x, y) in enumerate(loader):
loss = model(x, y) / accumulation_steps
loss.backward()
if (i + 1) % accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
# Solution 4: Gradient checkpointing (trade compute for memory)
from torch.utils.checkpoint import checkpoint
# In forward():
out = checkpoint(self.expensive_layer, input, use_reentrant=False)
Expected all tensors to be on the same device
Why It Happens
PyTorch tensors can live on different devices (CPU, cuda:0, cuda:1, etc.). Operations between tensors on different devices are not supported. Common causes: forgetting to move input data to GPU after moving the model, creating new tensors inside forward() without specifying device, or loading pretrained weights on CPU and forgetting to transfer.
The Fix
# The definitive pattern: use a single device variable
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)
for inputs, targets in dataloader:
inputs = inputs.to(device)
targets = targets.to(device)
output = model(inputs)
loss = criterion(output, targets)
# Inside model: create tensors on the same device as input
class MyModel(nn.Module):
def forward(self, x):
# Bad: mask = torch.zeros(x.size(0)) # CPU!
# Good:
mask = torch.zeros(x.size(0), device=x.device)
return x * mask
device once at the top of your script. Use .to(device) for model and data. Inside models, always use device=x.device when creating new tensors.Expected 4-dimensional input for Conv2d
Why It Happens
Conv2d expects input shape [batch, channels, height, width]. When passing a single image for inference, you have [channels, height, width] (3D), missing the batch dimension. This is one of the most common errors when transitioning from training (where DataLoader adds the batch dim) to inference (where you handle a single image).
The Fix
# Add batch dimension for single images
img = transform(pil_image) # [3, 224, 224]
img = img.unsqueeze(0) # [1, 3, 224, 224]
output = model(img)
# Remove batch dimension from output if needed
prediction = output.squeeze(0) # [10] instead of [1, 10]
# For batch of images, stack them:
batch = torch.stack([transform(img) for img in images]) # [N, 3, 224, 224]
.unsqueeze(0) single samples before passing to a model. Use HeyTensor's Conv2d Calculator to verify expected input format.view size is not compatible with input tensor's size and stride
Why It Happens
The .view() method requires that the tensor occupies a contiguous block of memory. After operations like .transpose(), .permute(), or certain slicing operations, the tensor's memory layout becomes non-contiguous. PyTorch cannot create a new view of non-contiguous memory without copying data.
The Fix
# Option 1: Call .contiguous() before .view()
x = x.transpose(1, 2).contiguous().view(batch, -1)
# Option 2: Use .reshape() instead (handles non-contiguous automatically)
x = x.transpose(1, 2).reshape(batch, -1)
# Option 3: Use torch.flatten()
x = torch.flatten(x, start_dim=1)
# Check if tensor is contiguous:
print(x.is_contiguous()) # False after transpose
.reshape() over .view() unless you specifically need a view (shared memory). See HeyTensor's View Compatibility Guide for details.expected scalar type Long but found Float
Why It Happens
PyTorch's classification loss functions (CrossEntropyLoss, NLLLoss) and nn.Embedding require integer (Long/int64) indices, not floating-point values. This error commonly appears when labels come from a CSV or numpy array as floats, or when you accidentally use a regression loss function's target format for classification.
The Fix
# Cast labels to long
labels = labels.long()
# Or create with correct dtype from the start
labels = torch.tensor([0, 1, 2, 0, 1], dtype=torch.long)
# In your Dataset:
class MyDataset(Dataset):
def __getitem__(self, idx):
x = torch.tensor(self.features[idx], dtype=torch.float32)
y = torch.tensor(self.labels[idx], dtype=torch.long)
return x, y
torch.long. Add .long() in your Dataset's __getitem__. See Loss Functions Reference for expected dtypes.Variable modified by inplace operation (gradient error)
Why It Happens
PyTorch's autograd system stores references to intermediate tensors computed during the forward pass. During backpropagation, it needs these exact tensors to compute gradients. In-place operations modify the tensor's data directly, so when autograd looks at the stored reference, the values have changed, making gradient computation incorrect or impossible. PyTorch detects this and raises an error rather than silently computing wrong gradients.
The Fix
# Replace ALL in-place operations with out-of-place versions:
# Instead of: Use:
# x += y x = x + y
# x -= y x = x - y
# x *= y x = x * y
# x.relu_() x = x.relu() or x = F.relu(x)
# x.sigmoid_() x = x.sigmoid()
# x[i] = val mask-based operations
# x.add_(y) x = x.add(y)
# x.mul_(y) x = x.mul(y)
# To find the exact line causing the error:
torch.autograd.set_detect_anomaly(True)
# Then run your training loop -- PyTorch will print the exact operation
_ during training. Use torch.autograd.set_detect_anomaly(True) to locate the exact offending line.Kernel size can't be greater than actual input size
Why It Happens
Each convolution or pooling layer reduces spatial dimensions. After multiple downsampling layers, the feature maps can shrink below the kernel size. This is especially common with small input images (CIFAR-10's 32x32, MNIST's 28x28) when using architectures designed for larger inputs (ImageNet's 224x224).
The Fix
# Trace dimensions through your network:
# Input: 32x32
# Conv(k=3, s=1, p=1): 32x32 (same padding)
# Pool(2): 16x16
# Conv(k=3, s=1, p=1): 16x16
# Pool(2): 8x8
# Conv(k=3, s=1, p=1): 8x8
# Pool(2): 4x4
# Conv(k=3, s=1, p=1): 4x4
# Pool(2): 2x2
# Conv(k=3, s=1, p=0): ERROR! 2 < 3
# Fix: add padding, reduce kernel, or remove a pool layer
self.conv5 = nn.Conv2d(256, 256, kernel_size=1) # 1x1 conv
# Or:
self.conv5 = nn.Conv2d(256, 256, kernel_size=3, padding=1) # same padding
Expected input batch_size to match target batch_size
Why It Happens
The model output and target tensors have different batch sizes. This usually means your forward pass accidentally changed the batch dimension (e.g., through a bad reshape), or your DataLoader produces mismatched input/target pairs. Less commonly, it happens when the final batch in an epoch has fewer samples than expected.
The Fix
# Debug: print shapes at every step
def forward(self, x):
print(f"Input: {x.shape}")
x = self.features(x)
print(f"After features: {x.shape}")
x = x.view(x.size(0), -1) # use x.size(0), not hardcoded batch
print(f"After flatten: {x.shape}")
x = self.classifier(x)
print(f"Output: {x.shape}")
return x
# In training loop: verify batch alignment
for inputs, targets in loader:
assert inputs.size(0) == targets.size(0), \
f"Batch mismatch: {inputs.size(0)} vs {targets.size(0)}"
x.size(0) or x.shape[0] for the batch dimension.Trying to backward through the graph a second time
Why It Happens
After .backward(), PyTorch frees the intermediate buffers used for gradient computation to save memory. If you call .backward() again on a tensor that shares the same computation graph, those buffers are gone. Common scenarios: computing multiple losses that share the same forward pass, or reusing hidden states in RNN training without detaching.
The Fix
# Best fix: combine losses before backward
output = model(x)
loss_ce = F.cross_entropy(output, targets)
loss_reg = 0.01 * sum(p.pow(2).sum() for p in model.parameters())
total_loss = loss_ce + loss_reg
total_loss.backward() # single backward pass
# If you must backward twice: retain_graph=True
loss1.backward(retain_graph=True) # keeps buffers
loss2.backward() # uses retained buffers
# For RNN: detach hidden state between sequences
for seq in sequences:
hidden = hidden.detach() # break graph connection
output, hidden = rnn(seq, hidden)
.backward(). For RNNs, detach hidden states between sequences with .detach().expected scalar type Float but found Half
Why It Happens
Float32 and Float16 tensors are being mixed in an operation. This commonly occurs when manually casting the model to half precision, using AMP incorrectly, or when BatchNorm/LayerNorm layers (which should stay in float32) receive half-precision inputs without autocast.
The Fix
# Best fix: use autocast for automatic dtype handling
with torch.cuda.amp.autocast():
output = model(x)
loss = criterion(output, target)
# If using manual half precision, cast inputs too:
model = model.half().cuda()
x = x.half().cuda()
# Keep BatchNorm in float32 (critical for stability):
for module in model.modules():
if isinstance(module, (nn.BatchNorm2d, nn.LayerNorm)):
module.float()
torch.cuda.amp.autocast() for mixed precision. Never manually call .half() on individual layers unless you know what you're doing.shape 'X' is invalid for input of size N
Why It Happens
The requested reshape dimensions don't multiply to equal the total number of elements in the tensor. For example, if you try to reshape a tensor with 25,088 elements into [32, 784], that would require 32 * 784 = 25,088 elements -- which only works if the batch size is exactly 32 and the feature dimension is exactly 784. If either is wrong, the reshape fails.
The Fix
# Never hardcode reshape dimensions
# Bad:
x = x.view(32, 784)
# Good: use -1 for automatic inference
x = x.view(x.size(0), -1) # batch preserved, features auto-computed
# Even better: use nn.Flatten()
self.flatten = nn.Flatten(start_dim=1)
x = self.flatten(x) # automatically flattens all dims except batch
-1 in exactly one dimension to let PyTorch infer the size. Or use nn.Flatten(). Use Flatten Calculator to verify dimensions.CUDA error: device-side assert triggered
Why It Happens
This cryptic error almost always means an index is out of bounds on the GPU. The top causes are: (1) a class label >= num_classes in CrossEntropyLoss, (2) an embedding index >= num_embeddings, (3) a negative index where unsigned was expected. CUDA errors are reported asynchronously, so the Python traceback may not point to the actual line.
The Fix
# Step 1: Get a better error message by running on CPU
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
model = model.cpu()
# Run the failing code on CPU -- you'll get a clear IndexError
# Step 2: Validate indices
assert labels.min() >= 0, f"Negative label: {labels.min()}"
assert labels.max() < num_classes, f"Label {labels.max()} >= num_classes {num_classes}"
# Step 3: For Embedding
assert indices.max() < embedding.num_embeddings
assert indices.min() >= 0
Expected hidden size mismatch in LSTM/GRU
Why It Happens
LSTM/GRU hidden states have shape (num_layers * num_directions, batch_size, hidden_size). If you initialize hidden states with a fixed batch size (e.g., 1) but pass input with a different batch size (e.g., 32), the dimensions don't match. This also happens with the last batch in an epoch when drop_last=False.
The Fix
# Always derive batch_size from the input tensor
def init_hidden(self, batch_size, device):
h0 = torch.zeros(self.num_layers, batch_size, self.hidden_size, device=device)
c0 = torch.zeros(self.num_layers, batch_size, self.hidden_size, device=device)
return (h0, c0)
def forward(self, x):
batch_size = x.size(0) # dynamic batch size
hidden = self.init_hidden(batch_size, x.device)
output, hidden = self.lstm(x, hidden)
return output
element 0 does not require grad and has no grad_fn
Why It Happens
You called .backward() on a tensor that isn't connected to any differentiable computation. Common causes: (1) using .detach() or .data too early in the computation, (2) creating the tensor with requires_grad=False (the default), (3) performing operations inside torch.no_grad(), (4) converting to numpy and back (breaks the gradient chain).
The Fix
# Check if tensor has gradient tracking
print(loss.requires_grad) # should be True
print(loss.grad_fn) # should not be None
# Common mistake: detaching predictions
pred = model(x).detach() # BREAKS gradient chain!
loss = criterion(pred, y)
loss.backward() # ERROR
# Fix: don't detach
pred = model(x)
loss = criterion(pred, y)
loss.backward() # works
# Common mistake: operations in no_grad
with torch.no_grad():
output = model(x)
loss = criterion(output, y)
loss.backward() # ERROR: output has no grad_fn
.detach() a tensor that needs gradients. Only use torch.no_grad() for inference/validation, not training.Tensor size mismatch at non-singleton dimension
Why It Happens
Two tensors in an element-wise operation have dimensions that cannot be broadcast. PyTorch broadcasting requires each dimension pair to either match or be 1. If tensor A has shape [32, 10] and tensor B has shape [32, 5], dimension 1 (10 vs 5) is incompatible. This commonly occurs in skip connections, attention mechanisms, or custom loss functions.
The Fix
# For skip connections: use a projection layer
class ResBlock(nn.Module):
def __init__(self, in_ch, out_ch):
super().__init__()
self.conv = nn.Conv2d(in_ch, out_ch, 3, padding=1)
# Add projection if dimensions differ
self.skip = nn.Conv2d(in_ch, out_ch, 1) if in_ch != out_ch else nn.Identity()
def forward(self, x):
return self.conv(x) + self.skip(x) # shapes now match
# For attention/feature fusion: ensure dimensions align
# Use Linear to project to matching dimensions
embed_dim must be divisible by num_heads
Why It Happens
Multi-head attention splits the embedding dimension evenly across heads. Each head operates on embed_dim / num_heads dimensions. If this isn't an integer, the split is impossible. For example, embed_dim=512 with num_heads=6 gives 85.33, which isn't valid.
The Fix
# Common valid configurations:
# embed_dim=256: heads=1,2,4,8,16,32,64,128,256
# embed_dim=512: heads=1,2,4,8,16,32,64,128,256,512
# embed_dim=768: heads=1,2,3,4,6,8,12,16,24,32,48,64,96,128,192,256,384,768
# Standard Transformer configurations:
attn = nn.MultiheadAttention(embed_dim=512, num_heads=8) # 512/8=64 per head
attn = nn.MultiheadAttention(embed_dim=768, num_heads=12) # 768/12=64 per head
attn = nn.MultiheadAttention(embed_dim=1024, num_heads=16) # 1024/16=64 per head
grad can be implicitly created only for scalar outputs
Why It Happens
You called .backward() on a tensor with more than one element. Autograd's starting point must be a scalar (single number). If your "loss" is a vector or matrix, PyTorch doesn't know how to start backpropagation because it needs a scalar seed gradient.
The Fix
# Bug: loss is not reduced to scalar
loss = (pred - target) ** 2 # shape [32, 10] -- not scalar!
loss.backward() # ERROR
# Fix: reduce to scalar
loss = ((pred - target) ** 2).mean() # scalar
loss.backward() # works
# If using a loss function, check the reduction parameter:
criterion = nn.MSELoss(reduction='mean') # returns scalar (default)
criterion = nn.MSELoss(reduction='none') # returns per-element loss!
# If reduction='none', manually reduce:
loss = criterion(pred, target).mean()
loss.shape (should be torch.Size([])). Use reduction='mean' or 'sum' in loss functions.Expected N channels but got M channels
Why It Happens
The Conv2d layer's in_channels does not match the number of channels in the input. Most common scenario: using a pretrained model (expects 3 RGB channels) on grayscale images (1 channel), or vice versa.
The Fix
# Option 1: Modify the first conv layer
model.conv1 = nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3)
# Option 2: Convert grayscale to 3-channel
transform = transforms.Compose([
transforms.Grayscale(num_output_channels=3),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
# Option 3: Repeat channels
x = x.repeat(1, 3, 1, 1) # [B, 1, H, W] -> [B, 3, H, W]
Deserialize on CUDA but torch.cuda.is_available() is False
Why It Happens
A model checkpoint was saved on a GPU machine, and you're loading it on a CPU-only machine (or one where CUDA isn't properly installed). By default, torch.load() tries to restore tensors on their original device.
The Fix
# Always specify map_location when loading
checkpoint = torch.load('model.pt', map_location='cpu')
# Then move to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.load_state_dict(checkpoint)
model = model.to(device)
# Best practice when saving:
torch.save(model.state_dict(), 'model.pt') # save state_dict, not full model
# state_dict is more portable and smaller
map_location='cpu' when loading checkpoints. Save state_dict() instead of the full model object for maximum portability.Methodology
Errors were ranked by combining three signals from Stack Overflow data:
- Question frequency: How many distinct SO questions mention this exact error (collected via SO API v2.3, April 2026).
- View count: Total views across all questions for each error type, indicating how many developers encounter it.
- Vote count: Community votes as a signal of error impact and answer quality.
The final ranking weights frequency (50%), views (30%), and votes (20%). Errors that appear only in niche contexts (specific GPU models, deprecated APIs) were excluded in favor of errors every PyTorch developer will encounter. See the full PyTorch Error Database for all 52 documented errors.
Frequently Asked Questions
What is the number one PyTorch error?
"mat1 and mat2 shapes cannot be multiplied" is the most common PyTorch error, accounting for roughly 23% of all shape-related questions on Stack Overflow. It occurs when a Linear layer's in_features does not match the incoming tensor size.
Why do shape mismatch errors dominate?
Shape mismatches account for 35% of all PyTorch errors because neural networks involve many sequential transformations where each layer's output must exactly match the next layer's expected input. A single misconfigured parameter cascades through the entire network.
How can I prevent PyTorch errors before running code?
Use HeyTensor's Chain Mode to trace tensor shapes through your network at design time. For memory planning, use the Memory Calculator. For individual layers, use the specific layer calculators (Conv2d, Linear, LSTM, etc.).
What percentage of PyTorch errors are CUDA-related?
CUDA-related errors (memory, device mismatch, driver issues) account for approximately 35% of all PyTorch errors on Stack Overflow. CUDA out-of-memory alone represents about 19%.
Are in-place operations always bad in PyTorch?
Not always, but they frequently cause gradient errors during training. The memory savings are minimal. Best practice: avoid in-place operations during training, use them only in inference or data preprocessing where gradients are not tracked.
About This Research
This ranking is part of HeyTensor's research series on PyTorch errors and debugging. For the full searchable error database, see the PyTorch Error Database. For statistical analysis and charts, see PyTorch Error Statistics.
For interactive shape calculation, use the Tensor Shape Calculator. For matrix math, visit ML3X. For encoding tools, try KappaKit. For experiment tracking, see EpochPilot.
Contact
Built and maintained by Michael Lip. Email [email protected] or visit the project on GitHub.