Question 1

What is model quantization?

Accepted Answer

Model quantization reduces the numerical precision of weights (e.g., from 32-bit floats to 8-bit integers), decreasing memory footprint and accelerating inference with minimal accuracy loss.

Question 2

How is quantized model size calculated?

Accepted Answer

Quantized size = number of parameters × bytes per parameter after quantization. For example, a 7B model in INT8 uses 7 × 1 = 7 GB, while FP32 uses 7 × 4 = 28 GB.

Question 3

Which quantization format gives the best compression?

Accepted Answer

Lower bit-width formats like INT4, GPTQ-4bit, or GGUF Q4_K_M offer higher compression (4× vs FP32) but may slightly impact accuracy. Choose based on your hardware and quality needs.