What Is the Difference Between ReLU and GELU?
ReLU: max(0, x) — fast, can cause dead neurons. GELU: x × Φ(x) — smoother, used in BERT/GPT. GELU is better for transformers, ReLU for CNNs.
Mathematical Definitions
# ReLU: sharp cutoff at 0
ReLU(x) = max(0, x)
# x = -2 → 0, x = 0 → 0, x = 3 → 3
# GELU: smooth, probabilistic gate
GELU(x) = x * Φ(x) # Φ = standard Gaussian CDF
# x = -2 → -0.045, x = 0 → 0, x = 3 → 2.996
# Approximation: 0.5 * x * (1 + tanh(√(2/π) * (x + 0.044715 * x³)))
Side-by-Side Comparison
Property | ReLU | GELU
------------------|--------------------|-----------------------
Formula | max(0, x) | x * Φ(x)
Negative inputs | Always 0 | Small negative values
Smoothness | Not smooth at 0 | Smooth everywhere
Dead neurons | Yes (gradient = 0) | No (always has gradient)
Speed | Fastest | ~15% slower
Used in | CNNs (ResNet, VGG) | Transformers (BERT, GPT)
Parameters | 0 | 0
PyTorch Code
import torch
import torch.nn as nn
import torch.nn.functional as F
x = torch.tensor([-2.0, -1.0, 0.0, 1.0, 2.0])
# ReLU
print(F.relu(x)) # tensor([0., 0., 0., 1., 2.])
# GELU
print(F.gelu(x)) # tensor([-0.0454, -0.1588, 0.0000, 0.8413, 1.9545])
# As modules
relu = nn.ReLU()
gelu = nn.GELU()
When to Use Which
- Use ReLU for CNNs (ResNet, VGG, etc.), simple feedforward networks, and when speed matters most
- Use GELU for transformer models, NLP tasks, and when you want smoother gradients
- Use SiLU/Swish (x × sigmoid(x)) as a middle ground — used in EfficientNet and some modern CNNs