Model Optimization Calculator
Quantization & Pruning Impact

Enter your model size and optimization technique to estimate size reduction, inference speedup, and accuracy impact. All calculations run in your browser.

Optimization Technique
FP16
16-bit float
BF16
Brain float16
INT8
8-bit integer
INT4
4-bit (GPTQ/AWQ)
INT2/NF4
2-4 bit mixed
Pruning
Structured/Sparse
Distillation
Student model

Estimated Impact

Optimized Size
Inference Speedup
Accuracy Impact
Baseline size
Optimized size
Memory saved
Accuracy risk: Negligible
PyTorch Code
Loading...

Frequently Asked Questions

What is the difference between FP16 and BF16?

Both use 16 bits but differ in bit allocation. FP16 (IEEE 754 half-precision) has 5 exponent bits and 10 mantissa bits — better precision for values near 1.0. BF16 (Brain Float16) has 8 exponent bits and 7 mantissa bits — same dynamic range as FP32, far fewer overflows. For training, prefer BF16 on Ampere+ (A100, RTX 30xx). For inference, both give ~2x memory reduction with minimal accuracy loss.

When should I use INT8 vs INT4 quantization?

INT8 (8-bit integer) is the safest choice: 4x size reduction vs FP32, 2-4x speedup on supported hardware, and accuracy loss typically under 0.5% on most tasks. Use torch.ao.quantization or bitsandbytes. INT4 (4-bit, e.g. GPTQ, AWQ, GGUF) gives 8x compression vs FP32 but with 1-3% accuracy drop. Best for inference-only deployment of large language models where memory is the bottleneck.

Does pruning actually reduce model file size?

It depends on the pruning type. Unstructured pruning (zeroing individual weights) does not reduce file size unless sparse storage formats are used — but can reduce compute if hardware supports sparse matmul. Structured pruning (removing entire channels, heads, or layers) directly reduces parameter count and gives actual size/speedup gains. For real deployment savings, use structured pruning or combine unstructured pruning with sparse inference engines (NVIDIA Ampere sparse cores).

What GPU is needed for INT4 inference?

INT4 quantization support: NVIDIA Turing+ (RTX 20xx, T4) for basic INT4, Ampere+ (A100, RTX 30xx) for fused INT4 GEMM kernels. For consumer GPUs, bitsandbytes 4-bit works on any CUDA GPU with compute capability 7.5+. Apple Silicon (MPS) supports 4-bit via llama.cpp Metal backend. CPU inference (GGUF INT4) works on any modern CPU but is ~5-20x slower than GPU.

Is this tool free?

Yes. All HeyTensor tools are free, run entirely in your browser, and require no signup or account.