This diagram would serve as a helpful companion to the article: https://www.nayu...

edflsafoiewq · on Feb 3, 2024

DEFLATE is basically huffman(lz(X)), which is obvious enough. The part that no one ever seems to motivate is how precisely you fit those together, ie. there is one tree for literals/length, another for distances, plus the extra bits.

nayuki · on Feb 3, 2024

> how precisely you fit those together

I mean, read the DEFLATE spec; it's rather short compared to modern formats (LZMA, Brotli, Zstandard, etc.).

The RFC version is "16 pages" long: https://www.rfc-editor.org/rfc/rfc1951.html . And here's my fancy HTML version: https://www.nayuki.io/page/deflate-specification-v1-3-html .

edflsafoiewq · on Feb 3, 2024

I know how it works. Like I said, I don't know the motivation for why that is the best way to connect LZ and Huffman coding together.

lifthrasiir · on Feb 3, 2024

I should note that it's hardly the best way, but it's easier to think DEFLATE as a layered algorithm: you catch repetitions via LZSS and code the remaining information with Huffman. You have two kinds of code because they have a very different distribution so it's beneficial to split them (and it's not surprising to have tens or hundreds of distributions in more sophiscated formats).

And extra bits are there because longer distances in LZSS are typically opportunistic so individual values have a low frequency (i.e. Zipfian). So exact distance 1280 and 1281 can appear only once, but maybe distances 1200--1299 appear frequently enough that you can have a distinct code for that plus two-digit adjustments. There are much more other ways to model distance distributions; for example, a set of codes for most recently used distances is common but DEFLATE is too old for that.