> That said, even in this simple case, decoding by hand was a pain.
Well doing it by hand at this level is mostly decoding quasi-ASCII by hand, and the compression isn't making it significantly more painful.
If you reorder the static huffman table and byte align things, the encoded data could look more like:
T O B E O R N O T T 0x85 0x09 T 0x88 0x0F 0xFF
So short lengths become 0x80 + length and short distances encode as themselves.
This version is pretty easy to decode by hand, and still 16 bytes. I'm arguably cheating by removing the 3 bit header, but even 17 bytes would be good.
Though the encoder deciding to repeat those Ts doesn't make much sense, shouldn't this be compressed to literal("TOBEORNOT") copy(6,9) copy(9,15)?
Many Deflate encoders, including zlib/gzip, are greedy encoders that work forwards, either for speed or to support streaming compression. They encode runs as they are found scanning forward, with limited lookahead to try to allow longer matches to preempt immediately preceding shorter matches. There is an "optimal parse" strategy that maximizes the runs found by processing the entire file backwards.
If you repack the plaintext using zopfli, it does encode the text as you suggest.
Not illustrated here, but the vast majority of deflated data uses a dynamic Huffman code which compresses the Huffman tree itself with another (fixed) Huffman tree. This sort of nested compression is in fact moderately popular; JPEG XL for example uses the same technique.
> That said, even in this simple case, decoding by hand was a pain.
Well doing it by hand at this level is mostly decoding quasi-ASCII by hand, and the compression isn't making it significantly more painful.
If you reorder the static huffman table and byte align things, the encoded data could look more like:
So short lengths become 0x80 + length and short distances encode as themselves.This version is pretty easy to decode by hand, and still 16 bytes. I'm arguably cheating by removing the 3 bit header, but even 17 bytes would be good.
Though the encoder deciding to repeat those Ts doesn't make much sense, shouldn't this be compressed to literal("TOBEORNOT") copy(6,9) copy(9,15)?
Many Deflate encoders, including zlib/gzip, are greedy encoders that work forwards, either for speed or to support streaming compression. They encode runs as they are found scanning forward, with limited lookahead to try to allow longer matches to preempt immediately preceding shorter matches. There is an "optimal parse" strategy that maximizes the runs found by processing the entire file backwards.
If you repack the plaintext using zopfli, it does encode the text as you suggest.
Being greedy is fine here, though. It's the exact same match but somehow cut one byte short.
Not illustrated here, but the vast majority of deflated data uses a dynamic Huffman code which compresses the Huffman tree itself with another (fixed) Huffman tree. This sort of nested compression is in fact moderately popular; JPEG XL for example uses the same technique.