Why do shape mismatch errors dominate PyTorch debugging?

Shape mismatch errors are the most common category (35% of all PyTorch errors) because neural networks involve many sequential transformations where each layer's output shape must exactly match the next layer's expected input shape. A single misconfigured parameter cascades through the entire network.

Original Research

25 Most Common PyTorch Errors — Copy-Paste Fix for Each

Q: What is the number one PyTorch error?

The most common PyTorch error is 'RuntimeError: mat1 and mat2 shapes cannot be multiplied'. It occurs when a Linear layer's in_features does not match the incoming tensor's last dimension. This single error accounts for roughly 23% of all shape-related questions on Stack Overflow.

Q: What percentage of PyTorch errors are CUDA-related?

CUDA-related errors (including memory, device mismatch, and driver issues) account for approximately 35% of all PyTorch errors on Stack Overflow. CUDA out-of-memory alone represents about 19% of all errors.

Q: Are in-place operations always bad in PyTorch?

In-place operations (like +=, .relu_(), tensor[i] = val) are not inherently bad but they frequently cause gradient computation errors during training. They save minimal memory and create hard-to-debug issues. Best practice: avoid in-place operations during training, use them only in inference or data preprocessing.

Ranked by frequency from Stack Overflow data analysis. Each error includes the exact message, why it happens, how to fix it, and how to prevent it. Stop guessing and fix errors in seconds.

By Michael Lip · Last updated May 26, 2026 · Based on analysis of 300+ Stack Overflow questions

Jump to Error

#1 mat1 and mat2 shapes cannot be multiplied
#2 CUDA out of memory
#3 Expected all tensors on same device
#4 Expected 4-dimensional input
#5 view size not compatible
#6 expected scalar type Long but found Float
#7 inplace operation gradient error
#8 Kernel size can't be greater than input
#9 Expected input batch_size to match target
#10 backward through graph a second time
#11 expected scalar type Float but found Half
#12 shape is invalid for input of size N
#13 device-side assert triggered
#14 Expected hidden size mismatch (LSTM)
#15 does not require grad and has no grad_fn
#16 tensor size mismatch at non-singleton dim
#17 embed_dim must be divisible by num_heads
#18 grad only for scalar outputs
#19 expected channels but got N channels
#20 Deserialize on CUDA but is_available False
#21 torch.compile backend error (Dynamo/Inductor)
#22 expected BFloat16 but found Float
#23 DDP uneven inputs across processes
#24 FlashAttention requires Ampere GPU or newer
#25 torch.cat dimension mismatch

mat1 and mat2 shapes cannot be multiplied

RuntimeError: mat1 and mat2 shapes cannot be multiplied (64x512 and 256x10)

shape_mismatch ~23% of shape errors nn.Linear, nn.Flatten

Why It Happens

A nn.Linear(in_features, out_features) layer performs matrix multiplication: output = input @ weight.T. The input's last dimension must equal in_features. This error occurs when they don't match, most commonly at the transition from convolutional layers to fully-connected layers. The flattened feature count depends on input spatial dimensions, kernel sizes, strides, and padding -- getting any one wrong cascades to the Linear layer.

The Fix

# Step 1: Find the actual flattened size
dummy = torch.zeros(1, 3, 32, 32)
dummy = self.features(dummy)  # run through conv layers
print(dummy.shape)  # e.g., [1, 64, 4, 4]
flat_size = dummy.view(1, -1).shape[1]  # 1024

# Step 2: Set Linear in_features to match
self.classifier = nn.Sequential(
    nn.Flatten(),
    nn.Linear(flat_size, 256),  # flat_size, not a guess
    nn.ReLU(),
    nn.Linear(256, 10)
)

# Or use LazyLinear (infers in_features automatically):
self.fc = nn.LazyLinear(10)  # in_features set on first forward

Prevention: Use HeyTensor's Flatten Calculator or Chain Mode to compute the exact flattened size. Or use nn.LazyLinear to defer shape inference to runtime.

CUDA out of memory

RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 11.17 GiB total capacity; 8.44 GiB already allocated)

memory_error ~19% of all errors All layers (training)

Why It Happens

GPU memory is finite. During training, memory is consumed by: model parameters (weights), gradients (same size as parameters), optimizer states (1-2x parameter size for Adam), forward activations (proportional to batch size and network depth), and PyTorch's caching allocator overhead. A model that fits in memory for inference may OOM during training because gradients and optimizer states multiply memory usage by 3-4x.

The Fix

# Solution 1: Reduce batch size (simplest)
loader = DataLoader(dataset, batch_size=8)  # was 32

# Solution 2: Mixed precision training (halves activation memory)
scaler = torch.cuda.amp.GradScaler()
for x, y in loader:
    with torch.cuda.amp.autocast():
        loss = model(x, y)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()
    optimizer.zero_grad()

# Solution 3: Gradient accumulation (effective large batch)
accumulation_steps = 4
for i, (x, y) in enumerate(loader):
    loss = model(x, y) / accumulation_steps
    loss.backward()
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

# Solution 4: Gradient checkpointing (trade compute for memory)
from torch.utils.checkpoint import checkpoint
# In forward():
out = checkpoint(self.expensive_layer, input, use_reentrant=False)

Prevention: Use HeyTensor's Memory Calculator before training. Rule of thumb: training requires ~4x the model size in memory (parameters + gradients + optimizer + activations).

Expected all tensors to be on the same device

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

device_mismatch ~15% of all errors All operations

Why It Happens

PyTorch tensors can live on different devices (CPU, cuda:0, cuda:1, etc.). Operations between tensors on different devices are not supported. Common causes: forgetting to move input data to GPU after moving the model, creating new tensors inside forward() without specifying device, or loading pretrained weights on CPU and forgetting to transfer.

The Fix

# The definitive pattern: use a single device variable
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = model.to(device)

for inputs, targets in dataloader:
    inputs = inputs.to(device)
    targets = targets.to(device)
    output = model(inputs)
    loss = criterion(output, targets)

# Inside model: create tensors on the same device as input
class MyModel(nn.Module):
    def forward(self, x):
        # Bad: mask = torch.zeros(x.size(0))  # CPU!
        # Good:
        mask = torch.zeros(x.size(0), device=x.device)
        return x * mask

Prevention: Define device once at the top of your script. Use .to(device) for model and data. Inside models, always use device=x.device when creating new tensors.

Expected 4-dimensional input for Conv2d

RuntimeError: Expected 4-dimensional input for 4-dimensional weight [64, 3, 3, 3], but got 3-dimensional input of size [3, 224, 224]

shape_mismatch ~8% of shape errors nn.Conv2d

Why It Happens

Conv2d expects input shape [batch, channels, height, width]. When passing a single image for inference, you have [channels, height, width] (3D), missing the batch dimension. This is one of the most common errors when transitioning from training (where DataLoader adds the batch dim) to inference (where you handle a single image).

The Fix

# Add batch dimension for single images
img = transform(pil_image)  # [3, 224, 224]
img = img.unsqueeze(0)       # [1, 3, 224, 224]
output = model(img)

# Remove batch dimension from output if needed
prediction = output.squeeze(0)  # [10] instead of [1, 10]

# For batch of images, stack them:
batch = torch.stack([transform(img) for img in images])  # [N, 3, 224, 224]

Prevention: Always .unsqueeze(0) single samples before passing to a model. Use HeyTensor's Conv2d Calculator to verify expected input format.

view size is not compatible with input tensor's size and stride

RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces)

shape_mismatch ~7% of shape errors Tensor.view, Tensor.transpose

Why It Happens

The .view() method requires that the tensor occupies a contiguous block of memory. After operations like .transpose(), .permute(), or certain slicing operations, the tensor's memory layout becomes non-contiguous. PyTorch cannot create a new view of non-contiguous memory without copying data.

The Fix

# Option 1: Call .contiguous() before .view()
x = x.transpose(1, 2).contiguous().view(batch, -1)

# Option 2: Use .reshape() instead (handles non-contiguous automatically)
x = x.transpose(1, 2).reshape(batch, -1)

# Option 3: Use torch.flatten()
x = torch.flatten(x, start_dim=1)

# Check if tensor is contiguous:
print(x.is_contiguous())  # False after transpose

Prevention: Prefer .reshape() over .view() unless you specifically need a view (shared memory). See HeyTensor's View Compatibility Guide for details.

expected scalar type Long but found Float

RuntimeError: expected scalar type Long but found Float

type_error ~12% of type errors CrossEntropyLoss, NLLLoss, Embedding

Why It Happens

PyTorch's classification loss functions (CrossEntropyLoss, NLLLoss) and nn.Embedding require integer (Long/int64) indices, not floating-point values. This error commonly appears when labels come from a CSV or numpy array as floats, or when you accidentally use a regression loss function's target format for classification.

The Fix

# Cast labels to long
labels = labels.long()

# Or create with correct dtype from the start
labels = torch.tensor([0, 1, 2, 0, 1], dtype=torch.long)

# In your Dataset:
class MyDataset(Dataset):
    def __getitem__(self, idx):
        x = torch.tensor(self.features[idx], dtype=torch.float32)
        y = torch.tensor(self.labels[idx], dtype=torch.long)
        return x, y

Prevention: Classification labels must always be torch.long. Add .long() in your Dataset's __getitem__. See Loss Functions Reference for expected dtypes.

Variable modified by inplace operation (gradient error)

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

gradient_error ~25% of gradient errors +=, .relu_(), tensor[i]=

Why It Happens

PyTorch's autograd system stores references to intermediate tensors computed during the forward pass. During backpropagation, it needs these exact tensors to compute gradients. In-place operations modify the tensor's data directly, so when autograd looks at the stored reference, the values have changed, making gradient computation incorrect or impossible. PyTorch detects this and raises an error rather than silently computing wrong gradients.

The Fix

# Replace ALL in-place operations with out-of-place versions:

# Instead of:                 Use:
# x += y                     x = x + y
# x -= y                     x = x - y
# x *= y                     x = x * y
# x.relu_()                  x = x.relu()  or  x = F.relu(x)
# x.sigmoid_()               x = x.sigmoid()
# x[i] = val                 mask-based operations
# x.add_(y)                  x = x.add(y)
# x.mul_(y)                  x = x.mul(y)

# To find the exact line causing the error:
torch.autograd.set_detect_anomaly(True)
# Then run your training loop -- PyTorch will print the exact operation

Prevention: Avoid all operations ending in _ during training. Use torch.autograd.set_detect_anomaly(True) to locate the exact offending line.

Kernel size can't be greater than actual input size

RuntimeError: Calculated padded input size per channel: (2 x 2). Kernel size: (3 x 3). Kernel size can't be greater than actual input size

shape_mismatch ~5% of shape errors nn.Conv2d, nn.MaxPool2d

Why It Happens

Each convolution or pooling layer reduces spatial dimensions. After multiple downsampling layers, the feature maps can shrink below the kernel size. This is especially common with small input images (CIFAR-10's 32x32, MNIST's 28x28) when using architectures designed for larger inputs (ImageNet's 224x224).

The Fix

# Trace dimensions through your network:
# Input: 32x32
# Conv(k=3, s=1, p=1): 32x32 (same padding)
# Pool(2): 16x16
# Conv(k=3, s=1, p=1): 16x16
# Pool(2): 8x8
# Conv(k=3, s=1, p=1): 8x8
# Pool(2): 4x4
# Conv(k=3, s=1, p=1): 4x4
# Pool(2): 2x2
# Conv(k=3, s=1, p=0): ERROR! 2 < 3

# Fix: add padding, reduce kernel, or remove a pool layer
self.conv5 = nn.Conv2d(256, 256, kernel_size=1)  # 1x1 conv
# Or:
self.conv5 = nn.Conv2d(256, 256, kernel_size=3, padding=1)  # same padding

Prevention: Use HeyTensor Chain Mode to trace spatial dimensions through every layer. This catches the problem before you run any code.

Expected input batch_size to match target batch_size

RuntimeError: Expected input batch_size (32) to match target batch_size (16)

shape_mismatch ~4% of shape errors Loss functions

Why It Happens

The model output and target tensors have different batch sizes. This usually means your forward pass accidentally changed the batch dimension (e.g., through a bad reshape), or your DataLoader produces mismatched input/target pairs. Less commonly, it happens when the final batch in an epoch has fewer samples than expected.

The Fix

# Debug: print shapes at every step
def forward(self, x):
    print(f"Input: {x.shape}")
    x = self.features(x)
    print(f"After features: {x.shape}")
    x = x.view(x.size(0), -1)  # use x.size(0), not hardcoded batch
    print(f"After flatten: {x.shape}")
    x = self.classifier(x)
    print(f"Output: {x.shape}")
    return x

# In training loop: verify batch alignment
for inputs, targets in loader:
    assert inputs.size(0) == targets.size(0), \
        f"Batch mismatch: {inputs.size(0)} vs {targets.size(0)}"

Prevention: Never hardcode batch size in reshape/view operations. Always use x.size(0) or x.shape[0] for the batch dimension.

#10

Trying to backward through the graph a second time

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed)

gradient_error ~20% of gradient errors All (multiple backward)

Why It Happens

After .backward(), PyTorch frees the intermediate buffers used for gradient computation to save memory. If you call .backward() again on a tensor that shares the same computation graph, those buffers are gone. Common scenarios: computing multiple losses that share the same forward pass, or reusing hidden states in RNN training without detaching.

The Fix

# Best fix: combine losses before backward
output = model(x)
loss_ce = F.cross_entropy(output, targets)
loss_reg = 0.01 * sum(p.pow(2).sum() for p in model.parameters())
total_loss = loss_ce + loss_reg
total_loss.backward()  # single backward pass

# If you must backward twice: retain_graph=True
loss1.backward(retain_graph=True)  # keeps buffers
loss2.backward()  # uses retained buffers

# For RNN: detach hidden state between sequences
for seq in sequences:
    hidden = hidden.detach()  # break graph connection
    output, hidden = rnn(seq, hidden)

Prevention: Combine all losses into one scalar before .backward(). For RNNs, detach hidden states between sequences with .detach().

#11

expected scalar type Float but found Half

RuntimeError: expected scalar type Float but found Half

type_error ~10% of type errors All (mixed precision)

Why It Happens

Float32 and Float16 tensors are being mixed in an operation. This commonly occurs when manually casting the model to half precision, using AMP incorrectly, or when BatchNorm/LayerNorm layers (which should stay in float32) receive half-precision inputs without autocast.

The Fix

# Best fix: use autocast for automatic dtype handling
with torch.cuda.amp.autocast():
    output = model(x)
    loss = criterion(output, target)

# If using manual half precision, cast inputs too:
model = model.half().cuda()
x = x.half().cuda()

# Keep BatchNorm in float32 (critical for stability):
for module in model.modules():
    if isinstance(module, (nn.BatchNorm2d, nn.LayerNorm)):
        module.float()

Prevention: Always use torch.cuda.amp.autocast() for mixed precision. Never manually call .half() on individual layers unless you know what you're doing.

#12

shape 'X' is invalid for input of size N

RuntimeError: shape '[32, 784]' is invalid for input of size 25088

shape_mismatch ~4% of shape errors Tensor.view, Tensor.reshape

Why It Happens

The requested reshape dimensions don't multiply to equal the total number of elements in the tensor. For example, if you try to reshape a tensor with 25,088 elements into [32, 784], that would require 32 * 784 = 25,088 elements -- which only works if the batch size is exactly 32 and the feature dimension is exactly 784. If either is wrong, the reshape fails.

The Fix

# Never hardcode reshape dimensions
# Bad:
x = x.view(32, 784)

# Good: use -1 for automatic inference
x = x.view(x.size(0), -1)  # batch preserved, features auto-computed

# Even better: use nn.Flatten()
self.flatten = nn.Flatten(start_dim=1)
x = self.flatten(x)  # automatically flattens all dims except batch

Prevention: Use -1 in exactly one dimension to let PyTorch infer the size. Or use nn.Flatten(). Use Flatten Calculator to verify dimensions.

#13

CUDA error: device-side assert triggered

RuntimeError: CUDA error: device-side assert triggered

device_mismatch ~8% of device errors Embedding, CrossEntropyLoss

Why It Happens

This cryptic error almost always means an index is out of bounds on the GPU. The top causes are: (1) a class label >= num_classes in CrossEntropyLoss, (2) an embedding index >= num_embeddings, (3) a negative index where unsigned was expected. CUDA errors are reported asynchronously, so the Python traceback may not point to the actual line.

The Fix

# Step 1: Get a better error message by running on CPU
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
model = model.cpu()
# Run the failing code on CPU -- you'll get a clear IndexError

# Step 2: Validate indices
assert labels.min() >= 0, f"Negative label: {labels.min()}"
assert labels.max() < num_classes, f"Label {labels.max()} >= num_classes {num_classes}"

# Step 3: For Embedding
assert indices.max() < embedding.num_embeddings
assert indices.min() >= 0

Prevention: Add index validation assertions before loss computation and embedding lookups. Debug on CPU first to get clear error messages.

#14

Expected hidden size mismatch in LSTM/GRU

RuntimeError: Expected hidden[0] size (2, 32, 256), got [2, 1, 256]

shape_mismatch ~3% of shape errors nn.LSTM, nn.GRU

Why It Happens

LSTM/GRU hidden states have shape (num_layers * num_directions, batch_size, hidden_size). If you initialize hidden states with a fixed batch size (e.g., 1) but pass input with a different batch size (e.g., 32), the dimensions don't match. This also happens with the last batch in an epoch when drop_last=False.

The Fix

# Always derive batch_size from the input tensor
def init_hidden(self, batch_size, device):
    h0 = torch.zeros(self.num_layers, batch_size, self.hidden_size, device=device)
    c0 = torch.zeros(self.num_layers, batch_size, self.hidden_size, device=device)
    return (h0, c0)

def forward(self, x):
    batch_size = x.size(0)  # dynamic batch size
    hidden = self.init_hidden(batch_size, x.device)
    output, hidden = self.lstm(x, hidden)
    return output

Prevention: Use HeyTensor's LSTM Calculator to verify hidden state shapes. Always compute batch_size from the input tensor dynamically.

#15

element 0 does not require grad and has no grad_fn

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

gradient_error ~15% of gradient errors All (requires_grad)

Why It Happens

You called .backward() on a tensor that isn't connected to any differentiable computation. Common causes: (1) using .detach() or .data too early in the computation, (2) creating the tensor with requires_grad=False (the default), (3) performing operations inside torch.no_grad(), (4) converting to numpy and back (breaks the gradient chain).

The Fix

# Check if tensor has gradient tracking
print(loss.requires_grad)  # should be True
print(loss.grad_fn)        # should not be None

# Common mistake: detaching predictions
pred = model(x).detach()  # BREAKS gradient chain!
loss = criterion(pred, y)
loss.backward()  # ERROR

# Fix: don't detach
pred = model(x)
loss = criterion(pred, y)
loss.backward()  # works

# Common mistake: operations in no_grad
with torch.no_grad():
    output = model(x)
loss = criterion(output, y)
loss.backward()  # ERROR: output has no grad_fn

Prevention: Never .detach() a tensor that needs gradients. Only use torch.no_grad() for inference/validation, not training.

#16

Tensor size mismatch at non-singleton dimension

RuntimeError: The size of tensor a (10) must match the size of tensor b (5) at non-singleton dimension 1

shape_mismatch ~3% of shape errors Element-wise ops (add, mul)

Why It Happens

Two tensors in an element-wise operation have dimensions that cannot be broadcast. PyTorch broadcasting requires each dimension pair to either match or be 1. If tensor A has shape [32, 10] and tensor B has shape [32, 5], dimension 1 (10 vs 5) is incompatible. This commonly occurs in skip connections, attention mechanisms, or custom loss functions.

The Fix

# For skip connections: use a projection layer
class ResBlock(nn.Module):
    def __init__(self, in_ch, out_ch):
        super().__init__()
        self.conv = nn.Conv2d(in_ch, out_ch, 3, padding=1)
        # Add projection if dimensions differ
        self.skip = nn.Conv2d(in_ch, out_ch, 1) if in_ch != out_ch else nn.Identity()

    def forward(self, x):
        return self.conv(x) + self.skip(x)  # shapes now match

# For attention/feature fusion: ensure dimensions align
# Use Linear to project to matching dimensions

Prevention: Print tensor shapes before element-wise operations. Use HeyTensor's Linear Calculator to plan projection layers.

#17

embed_dim must be divisible by num_heads

RuntimeError: embed_dim must be divisible by num_heads

shape_mismatch ~2% of shape errors MultiheadAttention, Transformer

Why It Happens

Multi-head attention splits the embedding dimension evenly across heads. Each head operates on embed_dim / num_heads dimensions. If this isn't an integer, the split is impossible. For example, embed_dim=512 with num_heads=6 gives 85.33, which isn't valid.

The Fix

# Common valid configurations:
# embed_dim=256: heads=1,2,4,8,16,32,64,128,256
# embed_dim=512: heads=1,2,4,8,16,32,64,128,256,512
# embed_dim=768: heads=1,2,3,4,6,8,12,16,24,32,48,64,96,128,192,256,384,768

# Standard Transformer configurations:
attn = nn.MultiheadAttention(embed_dim=512, num_heads=8)   # 512/8=64 per head
attn = nn.MultiheadAttention(embed_dim=768, num_heads=12)  # 768/12=64 per head
attn = nn.MultiheadAttention(embed_dim=1024, num_heads=16) # 1024/16=64 per head

Prevention: Use HeyTensor's MultiheadAttention Calculator to validate configurations. Standard head_dim is 64, so embed_dim = 64 * num_heads.

#18

grad can be implicitly created only for scalar outputs

RuntimeError: grad can be implicitly created only for scalar outputs

gradient_error ~10% of gradient errors All (non-scalar loss)

Why It Happens

You called .backward() on a tensor with more than one element. Autograd's starting point must be a scalar (single number). If your "loss" is a vector or matrix, PyTorch doesn't know how to start backpropagation because it needs a scalar seed gradient.

The Fix

# Bug: loss is not reduced to scalar
loss = (pred - target) ** 2  # shape [32, 10] -- not scalar!
loss.backward()  # ERROR

# Fix: reduce to scalar
loss = ((pred - target) ** 2).mean()  # scalar
loss.backward()  # works

# If using a loss function, check the reduction parameter:
criterion = nn.MSELoss(reduction='mean')  # returns scalar (default)
criterion = nn.MSELoss(reduction='none')  # returns per-element loss!
# If reduction='none', manually reduce:
loss = criterion(pred, target).mean()

Prevention: Always check that your loss is a scalar with loss.shape (should be torch.Size([])). Use reduction='mean' or 'sum' in loss functions.

#19

Expected N channels but got M channels

RuntimeError: Given groups=1, weight of size [64, 3, 7, 7], expected input[1, 1, 224, 224] to have 3 channels, but got 1 channels instead

shape_mismatch ~3% of shape errors nn.Conv2d

Why It Happens

The Conv2d layer's in_channels does not match the number of channels in the input. Most common scenario: using a pretrained model (expects 3 RGB channels) on grayscale images (1 channel), or vice versa.

The Fix

# Option 1: Modify the first conv layer
model.conv1 = nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3)

# Option 2: Convert grayscale to 3-channel
transform = transforms.Compose([
    transforms.Grayscale(num_output_channels=3),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Option 3: Repeat channels
x = x.repeat(1, 3, 1, 1)  # [B, 1, H, W] -> [B, 3, H, W]

Prevention: Check input channels before model creation. Use HeyTensor's Conv2d Calculator to verify in_channels matches your data.

#20

Deserialize on CUDA but torch.cuda.is_available() is False

RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False

device_mismatch ~5% of device errors torch.load

Why It Happens

A model checkpoint was saved on a GPU machine, and you're loading it on a CPU-only machine (or one where CUDA isn't properly installed). By default, torch.load() tries to restore tensors on their original device.

The Fix

# Always specify map_location when loading
checkpoint = torch.load('model.pt', map_location='cpu')

# Then move to GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.load_state_dict(checkpoint)
model = model.to(device)

# Best practice when saving:
torch.save(model.state_dict(), 'model.pt')  # save state_dict, not full model
# state_dict is more portable and smaller

Prevention: Always use map_location='cpu' when loading checkpoints. Save state_dict() instead of the full model object for maximum portability.

#21

torch.compile backend error (Dynamo/Inductor failure)

torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: LoweringException: ... Unsupported: call_function aten.foo

compilation_error Rising fast (PyTorch 2.x) torch.compile, torch._dynamo

Why It Happens

torch.compile() uses TorchDynamo to trace your model into FX graphs and the Inductor backend to generate optimized kernels. Not all Python code or PyTorch operations are supported by the tracer. Dynamic control flow, data-dependent branching, unsupported ops, and custom C++ extensions can all trigger graph breaks or outright compilation failures. This is the most common new error category in PyTorch 2.x development.

The Fix

# Step 1: Identify the problem with verbose logging
import torch._dynamo
torch._dynamo.config.verbose = True
torch._dynamo.config.suppress_errors = False  # don't silently fall back

# Step 2: Use fullgraph=False to allow graph breaks (easiest fix)
model = torch.compile(model, fullgraph=False)  # allows partial compilation

# Step 3: Skip problematic functions with torch._dynamo.disable
@torch._dynamo.disable
def custom_postprocess(x):
    # Complex Python logic that Dynamo can't trace
    return x

# Step 4: Reset Dynamo state if you get stale cache errors
torch._dynamo.reset()

# Step 5: Try a different backend if Inductor fails
model = torch.compile(model, backend="aot_eager")  # slower but more compatible

Prevention: Start with fullgraph=False and profile before enforcing fullgraph=True. Wrap non-traceable helper functions with @torch._dynamo.disable.

#22

expected BFloat16 but found Float (dtype mismatch)

RuntimeError: expected scalar type BFloat16 but found Float

type_error Rising (bf16 adoption) All layers (mixed precision)

Why It Happens

With the widespread adoption of bfloat16 training (especially on Ampere+ GPUs and TPUs), dtype mismatches between bf16 and fp32 tensors are increasingly common. Unlike fp16, bf16 has the same exponent range as fp32 but less precision. The error occurs when model weights are in bf16 but inputs are fp32, or when custom layers create fp32 tensors that interact with bf16 activations. This is especially common when loading Hugging Face models that default to bf16.

The Fix

# Option 1: Use torch.autocast for automatic dtype handling
with torch.autocast(device_type='cuda', dtype=torch.bfloat16):
    output = model(x)
    loss = criterion(output, target)

# Option 2: Cast inputs to match model dtype
x = x.to(dtype=model.parameters().__next__().dtype)

# Option 3: Explicitly cast model to bf16
model = model.to(dtype=torch.bfloat16)
x = x.to(dtype=torch.bfloat16)

# Option 4: When loading HuggingFace models
model = AutoModel.from_pretrained("model-name", torch_dtype=torch.float32)
# Or explicitly:
model = AutoModel.from_pretrained("model-name", torch_dtype=torch.bfloat16)

# Check if bf16 is supported on your GPU:
print(torch.cuda.is_bf16_supported())  # True on Ampere+

Prevention: Use torch.autocast instead of manual casting. Check torch.cuda.is_bf16_supported() before using bf16. When loading pretrained models, explicitly set torch_dtype.

#23

DDP uneven inputs across processes

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss.

distributed_error ~15% of DDP errors DistributedDataParallel

Why It Happens

In Distributed Data Parallel (DDP), all processes must participate in the same collective operations during backward. If one process finishes its data early (uneven dataset splits) or if some model parameters are not used in the forward pass (conditional branches, unused modules), the gradient all-reduce hangs or crashes. This is the most common distributed training error and becomes more frequent with multi-GPU setups.

The Fix

# Fix 1: Set find_unused_parameters for models with unused params
model = DistributedDataParallel(
    model,
    device_ids=[local_rank],
    find_unused_parameters=True  # handles conditional branches
)

# Fix 2: Use Join context for uneven inputs (PyTorch 1.10+)
from torch.distributed.algorithms.join import Join

with Join([model]):
    for batch in dataloader:
        loss = model(batch)
        loss.backward()
        optimizer.step()

# Fix 3: Ensure even data splits with DistributedSampler
sampler = DistributedSampler(
    dataset,
    num_replicas=world_size,
    rank=rank,
    drop_last=True  # ensures even batches across ranks
)
loader = DataLoader(dataset, sampler=sampler, batch_size=32)

# Fix 4: Pad the last batch to match across ranks
# (when drop_last=False is required)

Prevention: Always use DistributedSampler with drop_last=True for even splits. Set find_unused_parameters=True if your model has conditional branches or shared parameters.

#24

FlashAttention requires Ampere GPU or newer

RuntimeError: FlashAttention only supports Ampere GPUs or newer (sm80+). Found GPU with compute capability sm_75.

device_mismatch Rising (transformer adoption) F.scaled_dot_product_attention, FlashAttention

Why It Happens

FlashAttention uses GPU-specific Tensor Core instructions that are only available on NVIDIA Ampere (A100, RTX 30xx), Ada Lovelace (RTX 40xx), or Hopper (H100) architectures. Older GPUs like Turing (RTX 20xx, T4) or Volta (V100) lack these instructions. PyTorch's scaled_dot_product_attention tries FlashAttention first, and if the GPU doesn't support it, it raises an error or falls back to a slower kernel. Many Hugging Face models and training frameworks now enable FlashAttention by default.

The Fix

# Option 1: Use PyTorch's SDPA with automatic backend selection
import torch.nn.functional as F

# This automatically picks the best available backend
out = F.scaled_dot_product_attention(q, k, v)  # auto fallback

# Option 2: Explicitly disable FlashAttention backend
from torch.nn.attention import SDPBackend, sdpa_kernel

with sdpa_kernel([SDPBackend.MATH, SDPBackend.EFFICIENT_ATTENTION]):
    # Skips FlashAttention, uses compatible backends
    out = F.scaled_dot_product_attention(q, k, v)

# Option 3: For HuggingFace models, specify attn_implementation
model = AutoModel.from_pretrained(
    "model-name",
    attn_implementation="eager"  # skip flash_attention_2
)

# Check your GPU compute capability:
print(torch.cuda.get_device_capability())  # (8, 0) = Ampere

Prevention: Check GPU compute capability with torch.cuda.get_device_capability(). For pre-Ampere GPUs, use attn_implementation="eager" or SDPBackend.EFFICIENT_ATTENTION.

#25

torch.cat: tensor sizes mismatch at dimension

RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 256 but got size 128 for tensor number 1 in the list

shape_mismatch ~3% of shape errors torch.cat, torch.stack

Why It Happens

torch.cat concatenates tensors along a specified dimension. All other dimensions must match exactly. This error commonly occurs in U-Net skip connections (encoder and decoder feature maps have different spatial sizes due to odd input dimensions), feature pyramid networks, or when manually batching tensors of different sizes. The closely related torch.stack requires ALL dimensions to match.

The Fix

# Debug: print all tensor shapes before concatenation
tensors = [t1, t2, t3]
for i, t in enumerate(tensors):
    print(f"Tensor {i}: {t.shape}")

# Fix for U-Net skip connections (spatial size mismatch):
def forward(self, x, skip):
    x = self.upsample(x)
    # Crop or pad skip connection to match
    diff_h = skip.size(2) - x.size(2)
    diff_w = skip.size(3) - x.size(3)
    x = F.pad(x, [diff_w // 2, diff_w - diff_w // 2,
                   diff_h // 2, diff_h - diff_h // 2])
    return torch.cat([x, skip], dim=1)  # concat along channels

# Fix for variable-length sequences: use pad_sequence
from torch.nn.utils.rnn import pad_sequence
padded = pad_sequence(sequences, batch_first=True)  # pads to max length

# Fix for different feature sizes: project to same dim first
t1_proj = self.proj1(t1)  # [B, 256]
t2_proj = self.proj2(t2)  # [B, 256]  (was [B, 128])
combined = torch.cat([t1_proj, t2_proj], dim=1)  # [B, 512]

Prevention: Always print tensor shapes before torch.cat. For U-Net architectures, add padding/cropping at skip connections. Use HeyTensor Chain Mode to trace shapes through encoder-decoder networks.

Methodology

Errors were ranked by combining three signals from Stack Overflow data:

Question frequency: How many distinct SO questions mention this exact error (collected via SO API v2.3, April 2026).
View count: Total views across all questions for each error type, indicating how many developers encounter it.
Vote count: Community votes as a signal of error impact and answer quality.

The final ranking weights frequency (50%), views (30%), and votes (20%). Errors that appear only in niche contexts (specific GPU models, deprecated APIs) were excluded in favor of errors every PyTorch developer will encounter. See the full PyTorch Error Database for all 52 documented errors.

Frequently Asked Questions

What is the number one PyTorch error?

"mat1 and mat2 shapes cannot be multiplied" is the most common PyTorch error, accounting for roughly 23% of all shape-related questions on Stack Overflow. It occurs when a Linear layer's in_features does not match the incoming tensor size.

Why do shape mismatch errors dominate?

Shape mismatches account for 35% of all PyTorch errors because neural networks involve many sequential transformations where each layer's output must exactly match the next layer's expected input. A single misconfigured parameter cascades through the entire network.

How can I prevent PyTorch errors before running code?

Use HeyTensor's Chain Mode to trace tensor shapes through your network at design time. For memory planning, use the Memory Calculator. For individual layers, use the specific layer calculators (Conv2d, Linear, LSTM, etc.).

What percentage of PyTorch errors are CUDA-related?

CUDA-related errors (memory, device mismatch, driver issues) account for approximately 35% of all PyTorch errors on Stack Overflow. CUDA out-of-memory alone represents about 19%.

Are in-place operations always bad in PyTorch?

Not always, but they frequently cause gradient errors during training. The memory savings are minimal. Best practice: avoid in-place operations during training, use them only in inference or data preprocessing where gradients are not tracked.

About This Research

This ranking is part of HeyTensor's research series on PyTorch errors and debugging. For the full searchable error database, see the PyTorch Error Database. For statistical analysis and charts, see PyTorch Error Statistics.

For interactive shape calculation, use the Tensor Shape Calculator. For matrix math, visit ML3X. For encoding tools, try KappaKit. For experiment tracking, see EpochPilot.

Contact

Built and maintained by Michael Lip. Email [email protected] or visit the project on GitHub.

25 Most Common PyTorch Errors — Copy-Paste Fix for Each

Jump to Error

mat1 and mat2 shapes cannot be multiplied

Why It Happens

The Fix

CUDA out of memory

Why It Happens

The Fix

Expected all tensors to be on the same device

Why It Happens

The Fix

Expected 4-dimensional input for Conv2d

Why It Happens

The Fix

view size is not compatible with input tensor's size and stride

Why It Happens

The Fix

expected scalar type Long but found Float

Why It Happens

The Fix

Variable modified by inplace operation (gradient error)

Why It Happens

The Fix

Kernel size can't be greater than actual input size

Why It Happens

The Fix

Expected input batch_size to match target batch_size

Why It Happens

The Fix

Trying to backward through the graph a second time

Why It Happens

The Fix

expected scalar type Float but found Half

Why It Happens

The Fix

shape 'X' is invalid for input of size N

Why It Happens

The Fix

CUDA error: device-side assert triggered

Why It Happens

The Fix

Expected hidden size mismatch in LSTM/GRU

Why It Happens

The Fix

element 0 does not require grad and has no grad_fn

Why It Happens

The Fix

Tensor size mismatch at non-singleton dimension

Why It Happens

The Fix

embed_dim must be divisible by num_heads

Why It Happens

The Fix

grad can be implicitly created only for scalar outputs

Why It Happens

The Fix

Expected N channels but got M channels

Why It Happens

The Fix

Deserialize on CUDA but torch.cuda.is_available() is False

Why It Happens

The Fix

torch.compile backend error (Dynamo/Inductor failure)

Why It Happens

The Fix

expected BFloat16 but found Float (dtype mismatch)

Why It Happens

The Fix

DDP uneven inputs across processes

Why It Happens

The Fix

FlashAttention requires Ampere GPU or newer

Why It Happens

The Fix

torch.cat: tensor sizes mismatch at dimension

Why It Happens

The Fix

Methodology

Related HeyTensor Tools

PyTorch Error Database