Michael Hirn
Rounding Algorithms & Quantization
The concept of rounding may be almost as old as numbers themselves. If my own experience is any evidence then humans have an inbuilt recognition for "convenience" vs "precission" tradeoffs. Our language and communication is littered with it. When overt precision seems unnecessary and cumbersome, we shave it off. And when we do that with numbers, we call it rounding.
Now what is interesting is that on closer inspection of this seemingly trivial and commonplace concept we find a whole universe that leads us to profound places - including intelligence itself.
It is my understanding that rounding, and its many algorithms, are one of the oldest examples of quantization (Sheppard, 1897).
Quantization
A common definition for quantization is this:
Quantization is the division of a quantity into a discrete number of small parts, often assumed to be integral multiples of a common quantity.
For rounding we can express it like this: any real number x
can be rounded to the nearest number (e.g. integer). We can express this as q(x)
. This will result in a quantization error e = q(x) - x
(e.g. if x = 1.2
and we round to integers then e = 0.2
). Or rearranged: q(x) = x + e
.
Although quantization and its relationship to rounding was discovered as early as 1897, it was not until 1948 that quantization's fundamental role in analog-to-digital conversion was first recognized (e.g. Oliver, Pierce, and Shannon, 1948). From then on, the theory and practice of quantization was quickly developed.
There is more to be said about what quantization is, but as I currently understand it, one can trace three distinct quantization techniques:
- fixed-rate scalar quantization (rounding is an example)
- variable-rate quantization, which uses Shannon’s lossless source coding techniques to reduce rate
- predictive and transform coding (scalar quantization + linear processing to exploit source redundancy)
Theory and practice have since the mid 1980s adopted vector quantization (over say scalar quantization), due do greater practicality and performance, if I understand correctly. So one may want to see the above through tracks through the lense of vector quantization as applicable.
Quantization as Optimization Problem
The above hinted at it already, but quantization when applied can be thought of as an optimization problem, i.e. how much information can I sacrifice and still achieve "good-enough" outcomes. Or in the above words, what q(x)
should I chose to minimize e
.
Quantization and the work from Shannon, Bennett and others has developed a full theory around this question. I found Quantization by Gray and Neuhoff to be a wonderful introduction source.
Quantization and LLMs
Quantization seems to become a crucial technique also in the context of LLMS, where it describes (accurately) the rounding of the LLM's weights to decrease its memory footprint (quantization here usually means post-training quantization (PTQ)). Recent work by Chris De Sa's team at Cornell introduced QuIP# which quantizes LLMs to 2-bit (instead of the usual 16-bit) with only minimal reduction in fidelity of the model. In other words, a performance gain (in memory reduction) of 700% for sacrificing a few bips in fidelity. That's a phenomenal tradeoff. I was curious where the buck stops as 2-bit does not seem to be it (fidelity still high) and came across the ex-MIT team behind Neural Magic.
This write up does not do the topic of quantization justice of course, but by establishing prior context, some astounding gains with recent LLM technology, and its deep relationship with information and physical limitations (we I have not mentioned here yet), I will leave that as a jumping off point for going deeper with e.g. "adaptive rounding" which seems to be an important technical precursor to QuIP# which, I think, will also allows us to deep our understanding of quantization through some of the above three techniques.