What Is the Difference Between ReLU and GELU?

ReLU: max(0, x) — fast, can cause dead neurons. GELU: x × Φ(x) — smoother, used in BERT/GPT. GELU is better for transformers, ReLU for CNNs.

Mathematical Definitions

# ReLU: sharp cutoff at 0
ReLU(x) = max(0, x)
# x = -2 → 0,  x = 0 → 0,  x = 3 → 3

# GELU: smooth, probabilistic gate
GELU(x) = x * Φ(x)    # Φ = standard Gaussian CDF
# x = -2 → -0.045,  x = 0 → 0,  x = 3 → 2.996
# Approximation: 0.5 * x * (1 + tanh(√(2/π) * (x + 0.044715 * x³)))

Side-by-Side Comparison

Property          | ReLU              | GELU
------------------|--------------------|-----------------------
Formula           | max(0, x)          | x * Φ(x)
Negative inputs   | Always 0           | Small negative values
Smoothness        | Not smooth at 0    | Smooth everywhere
Dead neurons      | Yes (gradient = 0) | No (always has gradient)
Speed             | Fastest            | ~15% slower
Used in           | CNNs (ResNet, VGG) | Transformers (BERT, GPT)
Parameters        | 0                  | 0

PyTorch Code

import torch
import torch.nn as nn
import torch.nn.functional as F

x = torch.tensor([-2.0, -1.0, 0.0, 1.0, 2.0])

# ReLU
print(F.relu(x))     # tensor([0., 0., 0., 1., 2.])

# GELU
print(F.gelu(x))     # tensor([-0.0454, -0.1588, 0.0000, 0.8413, 1.9545])

# As modules
relu = nn.ReLU()
gelu = nn.GELU()

When to Use Which

Related Questions

Try the Activation Functions Tool