PyTorch CUDA Troubleshooting
15 Common Errors Fixed

Type any part of your CUDA error message below to instantly find the matching fix. Covers OOM errors, device mismatches, NCCL timeouts, cuDNN issues, illegal memory access, and more.

Showing all 15 errors — type to filter

RuntimeError: CUDA out of memory. Tried to allocate X.XX GiBHigh Impact

Fix CUDA Out of Memory (OOM)

PyTorch exhausted available GPU VRAM. Typically caused by batch size too large, no gradient clearing, or large intermediate activations being retained.

Fix

# 1. Reduce batch size
batch_size = 16  # Try halving your current size

# 2. Clear gradients every step (not just loss.backward)
optimizer.zero_grad(set_to_none=True)  # Frees memory, not just zeros

# 3. Use mixed precision training
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
with autocast():
    output = model(input)
    loss = criterion(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

# 4. Gradient checkpointing for large models
model = torch.utils.checkpoint.checkpoint_sequential(model, segments=4, input=x)

# 5. Empty cache between operations (debug only)
torch.cuda.empty_cache()
print(torch.cuda.memory_summary())

RuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the sameHigh Impact

Fix Device Mismatch (CPU vs CUDA)

Model weights are on CPU but input tensor is on GPU, or vice versa. Always move both model and data to the same device.

Fix

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Move model to device
model = model.to(device)

# Move each batch to device in your training loop
for x, y in dataloader:
    x = x.to(device)
    y = y.to(device)
    output = model(x)  # Now both on same device

# Loading saved checkpoints correctly
model.load_state_dict(torch.load("model.pth", map_location=device))
model.to(device)

RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp ... TimeoutHigh Impact

Fix NCCL Timeout in Distributed Training

A distributed training process stalled or hung, causing the NCCL watchdog to fire after the timeout threshold. One rank finished early, rank skew, or deadlock in data loading.

Fix

import torch.distributed as dist
import os

# Increase NCCL timeout (default is 30 minutes)
os.environ["NCCL_TIMEOUT"] = "3600"  # 1 hour in seconds

# Proper init with explicit timeout
dist.init_process_group(
    backend="nccl",
    timeout=torch.distributed.timedelta(minutes=60)
)

# Enable NCCL debug logging to find stalling rank
os.environ["NCCL_DEBUG"] = "INFO"
os.environ["NCCL_DEBUG_SUBSYS"] = "INIT,COLL"

# Ensure all ranks have same data length
# Use DistributedSampler with drop_last=True
from torch.utils.data.distributed import DistributedSampler
sampler = DistributedSampler(dataset, drop_last=True)
dataloader = DataLoader(dataset, sampler=sampler, drop_last=True)

RuntimeError: cuDNN error: CUDNN_STATUS_NOT_SUPPORTED or version mismatchMedium

Fix cuDNN Version Mismatch

PyTorch was compiled against a specific cuDNN version that differs from the installed one. Also triggered by unsupported input sizes or data types with certain cuDNN kernels.

Fix

# Check versions
import torch
print(torch.version.cuda)          # CUDA version PyTorch compiled with
print(torch.backends.cudnn.version())  # cuDNN version

# Option 1: Disable cuDNN benchmark and fall back to deterministic
torch.backends.cudnn.enabled = False
torch.backends.cudnn.benchmark = False
torch.use_deterministic_algorithms(True)

# Option 2: Use tf32 precision (if on Ampere GPU)
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

# Option 3: Reinstall matching PyTorch + CUDA + cuDNN
# Visit: https://pytorch.org/get-started/locally/
# pip install torch torchvision --index-url https://download.pytorch.org/whl/cu121

RuntimeError: CUDA error: an illegal memory access was encounteredHigh Impact

Fix Illegal Memory Access

GPU tried to read/write outside allocated memory. Often caused by out-of-bounds tensor indices, negative values in embedding indices, or class indices exceeding num_classes in CrossEntropyLoss.

Fix

# 1. Enable device-side assertions to get the real error
import os
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
# Re-run your code — you'll get a more descriptive error

# 2. Check embedding indices are in range [0, num_embeddings)
assert indices.min() >= 0
assert indices.max() < embedding.num_embeddings

# 3. Check CrossEntropyLoss target labels
assert targets.min() >= 0
assert targets.max() < num_classes

# 4. Check for NaN inputs
assert not torch.isnan(tensor).any(), "NaN detected in input"

# 5. Verify contiguous memory before view operations
tensor = tensor.contiguous()
tensor = tensor.view(batch_size, -1)

RuntimeError: CUDA error: no kernel image is available for execution on the deviceHigh Impact

Fix GPU Architecture Mismatch (sm_ mismatch)

PyTorch binary was compiled for a different CUDA compute capability than your GPU. Common when running a binary compiled for sm_80 (Ampere) on a sm_70 (Volta) GPU.

Fix

# Check your GPU compute capability
import torch
print(torch.cuda.get_device_capability())  # e.g., (8, 6) = sm_86

# Reinstall PyTorch matching your GPU generation:
# RTX 30xx / A100 (Ampere, sm_80/86): cu121
# RTX 20xx / V100 (Turing/Volta, sm_70/75): cu118
# GTX 10xx (Pascal, sm_61): cu118

# Install command (example for CUDA 12.1)
# pip install torch --index-url https://download.pytorch.org/whl/cu121

# Verify after install
print(torch.__version__)
print(torch.cuda.is_available())
print(torch.cuda.get_device_name(0))

RuntimeError: CUDA driver version is insufficient for CUDA runtime versionMedium

Fix CUDA Driver vs Runtime Version

The installed NVIDIA driver is too old for the CUDA toolkit version PyTorch requires. Each CUDA version has a minimum driver requirement.

Fix

# Check driver version
# Run in terminal: nvidia-smi
# Check the "CUDA Version" in top right — this is the MAX supported CUDA

# CUDA 12.x requires driver >= 525.60.13 (Linux) / 527.41 (Windows)
# CUDA 11.x requires driver >= 450.80.02 (Linux) / 452.39 (Windows)

# Option 1: Update NVIDIA driver
# Ubuntu: sudo apt install nvidia-driver-535
# Windows: Download from https://www.nvidia.com/Download/index.aspx

# Option 2: Downgrade PyTorch to match your driver
# For CUDA 11.8 (older driver):
# pip install torch --index-url https://download.pytorch.org/whl/cu118

import subprocess
result = subprocess.run(["nvidia-smi"], capture_output=True, text=True)
print(result.stdout)

RuntimeError: invalid device ordinal / device not foundEasy Fix

Fix Invalid CUDA Device Index

Requesting a GPU index that does not exist on this machine. Hardcoded cuda:2 on a single-GPU system, or CUDA_VISIBLE_DEVICES restricting visible devices.

Fix

# Check available devices
import torch
print(f"GPUs available: {torch.cuda.device_count()}")
for i in range(torch.cuda.device_count()):
    print(f"  GPU {i}: {torch.cuda.get_device_name(i)}")

# Never hardcode device index — use dynamic selection
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Check CUDA_VISIBLE_DEVICES environment variable
import os
print("CUDA_VISIBLE_DEVICES:", os.environ.get("CUDA_VISIBLE_DEVICES", "all"))

# If using SLURM or containers, explicitly set
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # Only expose GPU 0

RuntimeError: Expected all tensors to be on the same device, but found at least two devicesMedium

Fix Mixed Device Tensors in Operations

An operation (addition, concatenation, matrix multiply) received tensors from different devices. Often a newly created tensor defaulting to CPU while others are on CUDA.

Fix

# Debug: print all tensor devices in your forward pass
for name, param in model.named_parameters():
    print(f"{name}: {param.device}")

# Common culprit: creating new tensors without specifying device
# Wrong:
mask = torch.ones(batch_size, seq_len)  # defaults to CPU!
# Correct:
mask = torch.ones(batch_size, seq_len, device=x.device)

# Use tensor.to(other_tensor.device) to match devices
positions = torch.arange(seq_len).to(x.device)

# Or use device-aware tensor creation
zeros = torch.zeros_like(x)  # inherits device from x
eye = torch.eye(n, device=x.device)

RuntimeError: CUDA error: device-side assert triggeredHigh Impact

Fix Device-Side Assert (Asynchronous Error)

A CUDA kernel assertion failed asynchronously. The reported traceback is usually wrong. Set CUDA_LAUNCH_BLOCKING=1 to get the actual error location and cause.

Fix

# Step 1: Get the real error location
import os
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"
# Re-run — traceback now points to actual failing line

# Common root causes:
# 1. Class index out of range in CrossEntropyLoss
loss_fn = torch.nn.CrossEntropyLoss()
# Targets must be in [0, num_classes-1], not [1, num_classes]
targets = targets - 1  # If 1-indexed, convert to 0-indexed

# 2. Embedding index out of range
assert input_ids.max() < vocab_size, f"Max token {input_ids.max()} >= vocab_size {vocab_size}"

# 3. NaN/Inf in inputs causing assertion failures
torch.autograd.set_detect_anomaly(True)

RuntimeError: a Tensor with requires_grad=True was used in an in-place operationMedium

Fix In-Place Operation on Grad Tensor

PyTorch cannot differentiate through in-place operations on tensors that require gradients. The gradient computation graph is broken by in-place modification.

Fix

# Wrong: in-place operation on tensor that needs gradient
x = model_output
x += positional_encoding  # In-place! Breaks autograd

# Correct: out-of-place addition
x = x + positional_encoding  # Creates new tensor, safe

# Same issue with indexing:
# Wrong:
output[mask] = 0.0
# Correct:
output = output * (~mask).float()

# Cloning before in-place if needed:
x = x.clone()
x[indices] = new_values

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemmHigh Impact

Fix cuBLAS Execution Failed

cuBLAS matrix multiplication kernel failed. Often caused by NaN/Inf values in tensors, incompatible tensor shapes, or insufficient GPU memory for the GEMM operation.

Fix

# 1. Check for NaN/Inf before the failing operation
def check_tensor(t, name="tensor"):
    if torch.isnan(t).any():
        print(f"NaN detected in {name}")
    if torch.isinf(t).any():
        print(f"Inf detected in {name}")
    return t

# 2. Gradient clipping to prevent explosion
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# 3. Use anomaly detection for exact location
torch.autograd.set_detect_anomaly(True)

# 4. Force synchronous execution for debug
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

# 5. Check weight initialization
# Kaiming/He init for ReLU networks:
torch.nn.init.kaiming_normal_(layer.weight, mode='fan_out', nonlinearity='relu')

RuntimeError: Error building extension, CUDA_HOME not set / PTX JIT compilation failedMedium

Fix CUDA_HOME and Custom Extension Build Errors

Building CUDA C++ extensions (e.g., custom ops, Flash Attention) requires CUDA toolkit installed and CUDA_HOME pointing to it. PTX JIT compilation fails if driver and toolkit versions differ greatly.

Fix

# Check CUDA_HOME
import os
print(os.environ.get("CUDA_HOME", "NOT SET"))

# Set CUDA_HOME (Linux)
# export CUDA_HOME=/usr/local/cuda
# Or: export CUDA_HOME=$(dirname $(dirname $(which nvcc)))

# Verify nvcc is in PATH
# nvcc --version

# For custom extensions, ensure CUDA toolkit matches PyTorch
import torch
print(torch.version.cuda)  # Must match your nvcc version

# Build with correct arch flags for your GPU
import torch.utils.cpp_extension
torch.utils.cpp_extension.CUDA_HOME  # Should not be None

# Pre-compile extensions instead of JIT when possible
from torch.utils.cpp_extension import load
my_op = load(name="my_op", sources=["my_op.cu"], verbose=True)

RuntimeError: Cannot re-initialize CUDA in forked subprocess / CUDA and multiprocessingMedium

Fix CUDA + DataLoader Multiprocessing

CUDA does not work with Python's default fork-based multiprocessing. DataLoader workers that try to use CUDA will fail unless the start method is set to spawn or forkserver.

Fix

# Option 1: Do not use CUDA in DataLoader workers
# Keep all CUDA operations in main process, only use CPU in workers
def worker_fn(x):
    return preprocess_cpu_only(x)  # No CUDA here

# Option 2: Set multiprocessing start method
import torch.multiprocessing as mp
mp.set_start_method("spawn", force=True)

# Option 3: Ensure DataLoader uses spawn
dataloader = DataLoader(
    dataset,
    num_workers=4,
    multiprocessing_context="spawn"
)

# Option 4: Use pin_memory for faster CPU->GPU transfers
dataloader = DataLoader(dataset, num_workers=4, pin_memory=True)

RuntimeError: "expected scalar type Half but found Float" in autocast / mixed precisionEasy Fix

Fix Mixed Precision / AMP Dtype Errors

Tensors of different floating-point precisions (float32 vs float16) are mixed in an operation outside of autocast context. Or autocast is applied inconsistently across model and loss.

Fix

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

# Wrap the ENTIRE forward pass in autocast, including loss
with autocast(dtype=torch.float16):  # or bfloat16 for Ampere+
    output = model(input)
    loss = criterion(output, target)  # Must be inside autocast

# Scale loss before backward
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad(set_to_none=True)

# For bfloat16 (RTX 30xx, A100 — better stability, no scaler needed)
with autocast(device_type="cuda", dtype=torch.bfloat16):
    output = model(input)
    loss = criterion(output, target)
loss.backward()  # No scaler needed for bfloat16

No matching errors found. Try a different search term, or browse all 15 errors above.

Try the CUDA OOM Solver Tool

PyTorch CUDA Troubleshooting15 Common Errors Fixed

Related Questions

PyTorch CUDA Troubleshooting
15 Common Errors Fixed