Recommended Max Batch Size
32
per GPU / per accumulation step
Total Batch Size (with Grad Accumulation)
256
across accumulation steps
Memory Utilization
78%
of available VRAM
Memory Breakdown
Model Weights
28.0 GB
Activations (batch)
12.0 GB
Gradients
28.0 GB
Optimizer States
14.0 GB
Total Per-Batch
18.8 GB
Recommendation: Use batch size 32 with 8 gradient accumulation steps for a total batch size of 256. This keeps VRAM usage at ~78% with headroom for intermediate computations.