How Quantization Works: From a Matrix Multiplication Perspective

Introduction Quantization is a commonly used acceleration technique in NN inference. The primary computational workloads in NNs come from Convolution, Linear Layers, and Attention, which are implemented by GEMM in the lower level. This blog aims to discuss the principles of quantization from the matrix multiplication perspective and to explain why some quantization methods are impractical. It also aims to review several LLM quantization methods from this perspective. I define practical quantization as follows: ...

2024-03-06 · 8 min · Monsoon