⚡ Mixed Precision Estimator

FP16 / BF16 / TF32 — memory & speed vs FP32 baseline

e.g., 350M = 350
🧠 hidden dim estimated: 4096
📦 Model memory
🎯 Optimizer states
🔁 Activations (est.)
📐 Gradients
💾 Total VRAM
⚡ Speedup vs FP32
📉 Memory savings
ModeModelOptimizerActivationsTotal VRAMSpeedupSavings
# PyTorch AMP / autocast example (FP16 mixed)
import torch
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()
for data, target in dataloader:
  with autocast(dtype=torch.float16):
    output = model(data)
    loss = loss_fn(output, target)
  scaler.scale(loss).backward()
  scaler.step(optimizer)
  scaler.update()