Sunday, November 29, 2015

The Key Missing API in Lossless Data Compressors

There's a key streaming API missing from every lossless codec I've seen. This is the next API going into lzham_codec_devel (what will be LZHAM v1.1). This API bridges the gap between the lossless and lossy worlds, enables some other interesting use cases, and it should be easy to add to most designs.

For some background, the (previously) complete set of lossless compression library API's are:

CompressMemoryToMemory() - comp buffer in memory to another buffer
DecompressMemoryToMemory() decomp buffer in memory to another buffer
GetCompressBound()- returns max possible comp size given size of data to compress

CreateCompressContext() - create new comp context
DestroyCompressContext() - destroy comp context
ResetCompressContext() - reset comp context, reusing allocated memory
CompressContinue() - compress some bytes from input to output buf
CompressFlush(bool end) - forcibly flush comp, generating output

CreateDecompressContext() - create new decomp context
DestroyDecompressContext() - destroy decomp context
ResetDecompressContext() - reset decomp context, reusing allocated memory
DecompressContinue() - decompress some bytes from input to output buf

The missing streaming API is:

double CompressQuery(comp_ctx *pCtx, const void *pBuf, size_t size)

This function efficiently computes the compressed size, in fractional bits (and/or integer bytes) of the specified buffer using the current compression context. Importantly, the current compression context (entropy coding state, sliding dictionary, statistical models, etc.) is not modified. 

This API basically gives you an upper bound on how many compressed bits would be added to the output given a particular input. (It's an upper bound, not exact, because the flush imposes a hard artificial LZ phrase boundary on the output.)

This API can be inefficiently emulated to some degree on streaming compressors that support flushing, except you'll have to settle for only integer byte results, and put up with a full recompress before each query. CompressQuery() is superior because it can give you fractional bit results, it doesn't need to fully update its statistical models, or even fully entropy code the output (it just has to compute how many bits it would output, which codecs like LZMA/LZHAM can do today because they must compute accurate "bit prices" during near-optimal parsing).

CompressQuery() should be implementable in any lossless compressor, not just LZ based ones. Typically, lossless compression is viewed as some black box that occurs after you've generated some data. With this API you can now intimately interact with the compression engine in order to choose the set of data that leads to higher compression.

Example ideas of what you can now do with this API:

1. Rate distortion optimized (RDO) DXTc/PVRTC/etc. compression (i.e. like crunch)

A typical DXTc block compressor evaluates hundreds to thousands of possible packed block candidates, many of them with very similar or virtually the same PSNR's.  A simple RDO DXTc compressor would compute a list of candidate DXTc blocks for each input block, query the backend lossless compressor on each candidate block to determine how many bits would be added to the compressed output, then choose the encoding that strikes the best balance between coded bits and quality. The block compressor then codes (or "commits") this specific block to the compressed output stream by calling CompressContinue(), then continues to the next block and starts the process over again.

This is just local optimization. A more advanced version would use a dynamic programming approach to look ahead multiple blocks (like LZMA or LZHAM's parsers do) to build a graph so the best combination of blocks can be chosen that best balances compressed bits vs. PSNR.

I have already done several promising experiments in this area on DXTc textures while writing crunch. Interestingly, this approach is compatible with virtually any block based format.

2. Universal prediction engine
Honestly this usage is pretty far out and speculative. Here's one possibility, in the context of a real-time or turn based game:

Each frame, encode the position of a player character into a C-style POD fixed size data structure (let's call them records). Compress the raw record bytes by calling CompressContinue(). Simple enough.

Now here's where things get interesting. Let's say we want to try and determine the probability that the character will be at position X on the next frame. Evaluate the next X possible legal gamespace positions the character can be in, and encode these positions into records like usual. Now iterate through each possible legal position's record and call CompressQuery() on each record's serialized struct. 

The return value will be how many fractional bits are needed to encode each structure given the compressor's current context. The more bits needed to encode a record, the higher the record's entropy, and the less likely (or more "surprising") the position is to the compressor. More probable (less surprising) records will require fewer bits. 

Once the codec forms a decent model of the input records it should be able to predict the next position with (hopefully) reasonable certainty. This approach could be quite interesting given a sophisticated enough statistical modeling system and entropy coding backend. (Or not, I haven't tried it yet.)

3. bsdiff-style preprocessing for delta compression
bsdiff is a LZ-like approach for creating patch files. It consists of a command stream, a delta byte stream, and a literal byte stream, which can all be separately compressed (as bsdiff.exe does using bzip2). Importantly, there is no single "right" way of encoding a patch stream, i.e. there are many possible ways of generating fully valid patch command streams.

CompressQuery() can be called while composing the various streams in order to determine the most optimal set of commands/delta bytes/literal bytes to generate, in order to minimize the resulting compressed patch file size.

4. Optimized PNG-like lossless image compression
The typical PNG compressor adaptively chooses the filter to use on each scanline which minimizes the sum of absolute errors. A better metric would be, for each filter, to call CompressQuery() to determine how many compressed bits would be output if that filter was selected, and choose the filter that results in the fewest bits.

5. Basically any data that will be losslessly compressed that has multiple valid or usable encodings (mesh vertex data, curve fitted animation data, VQ data, etc.) could benefit from tightly coupling the data generation process with the backend lossless compressor.

Important aspects of LZHAM's design

I'm going to go through many of the major lossless codecs (LZMA, Zstd, LZ4, Deflate, bzip2, PAQ, etc.) and list the features and properties that made them unique or interesting, especially when first released. Let's start with LZHAM (yes I'm shamelessly beating my own drum here, but hey it's my tech blog). I think it's very important and interesting to understand the past.

LZHAM alpha was first released in Aug. 15, 2010 (according to Google Code), but the fast entropy decoder experiments and classes were written in early 2009 (before I joined Valve). At the time, the practical lossless data compression community didn't seem to have much focus or direction. They were kinda all over the map, and Charle Bloom's excellent reverse engineering of LZMA did not occur until after LZHAM's public release.

This codec was designed for next-generation video games, basically titles I thought would be eventually made with Source 2. Valve was awesome at allowing developers to work on open source and even commercial projects at home. The team didn't think data compression was an important thing to work on, so I decided to work on it on my spare time.

For some background, I was not able to use LZMA on Halo Wars because it was incredibly slow on X360, and Microsoft Game Studios stopped using my internal highly X360 optimized Deflate codec ("eslib") and switched to LZX. I used 7-zip on the Halo Wars build server, and was very impressed with its ratio, especially when in Deflate mode. I always wondered how it was able to achieve such high ratios when compressing to the old Deflate format, and I wanted to understand why.

Some of the major features it demonstrated:

- Micro-threaded compressor
Dictionary updating, match finding, and parsing all in parallel.
A lock-free approach is used to communicate between parser threads and match finder threads.
The usual approach to threading a compressor blocks up the input and sacrifices ratio, which is not necessary with the correct design.
Inspired by my experience writing the multithreaded Halo Wars engine, and the lock free stuff was inspired by experiments I was seeing done on Source 2's graphics engine.

- Interleaved coding
Huffman and binary arithmetic coding interleaved into the same bitstream. The compressor batches all symbols and simulates the entropy decoding steps the decompressor will use in order to figure out how to interleave the output bitstream.

I came up with this design because I wanted a simple symbol_codec class that supported totally free form usage of arithmetic, Huffman, and raw bits. This class was inspired by Amir Said's excellent papers and sample code. I tested it on a laptop and just keep on optimizing it for higher decoding performance over a few weeks time.

LZHAM also showed that Huffman coding still had legs in high ratio codecs. Very low or high probability symbols (what I called high "skew" symbols), where Huffman's prefix coding limitations are most obvious, can use fast and simple binary arithmetic coding, while everything else can be done with static Huffman coding, with bulk table updating for adaption. Also around this time, Andrew Polar showed it was possible to quickly update prefix codes.

- Best of X arrivals parsing (called "extreme" parsing in the code)
This was obvious after figuring out how to construct a parse graph.
Inspired by the path finding algorithms used in games.

- Other things it did that I think are important:
zlib compatible API - It's the standard "universal" lossless compression API, it makes no sense not to support it. To my knowledge LZHAM and miniz were the first to try and copy zlib's API.
Streaming support - I question how useful this is to many developers, but you need it otherwise you're limited to available RAM or have to use blocking which hurts ratio.
Seed dictionaries - Occasionally valuable.
Every update was thoroughly tested before pushing the code. Random failures or crashes = the kiss of death for a new codec trying to be accepted.

For LZHAM I decided that the best way to get noticed as adding value in a very competitive space was to match LZMA's ratio as closely as possible and just move "right" (faster) on the decompression speed/ratio Pareto frontier. I purposely de-emphasized the compression speed/ratio frontier, favoring offline compression.

One critical mistake I made in the alphas was optimizing too much for the Large Text Compression Benchmark, which is 100MB of Wikipedia text. This led to me going down a blind alley with higher order coding experiments, which used way too many Huffman tables.

Friday, November 27, 2015

Future Directions in Lossless Compression

My current guesses on where this field could go. This is biased towards asymmetric codecs (offline compression for data distribution, not real-time compression/decompression).

Short Term

LZ4: Higher ratio using near-optimal parsing but same basic decompressor/instruction set (note I doubt the LZ4 arrow can move up as much as I've illustrated).

LZHAM: Faster decompression by breaking out literals/delta literals/matches to separate entropy coded blocks, switch to new entropy decoder. Other ideas: multithreaded entropy decoding, combine multiple binary symbols into single non-binary symbols.

ZSTD: Refine current implementation: stronger compressor, profile and optimize all loops.

BROTLI: Brotli's place on the decompression frontier is currently too fuzzy on the ratio axis, so an easy prediction is the compressor will get tightened up. Its current entropy coding design may have trouble expanding to the right much further. (The same situation as LZHAM. The fast entropy coding space is moving rapidly right now.)

Long Term

New Territory: Theoretical future "holy grail" codec which will obsolete most other codecs. Once this codec is on the scene most others will be as relevant as compress. If you are working in the compression space commercially this is where you should be heading.

Note the circle is rough. I tried to roughly match Brotli's ratio in region 4, but it could go higher to be closer to LZMA/LZHAM.

Some ideas: blocked interleaved entropy coding with SIMD optimizations, entropy decoding in parallel with decompression, near-optimal parsing with best of X arrivals, cloud compression to search through hundreds of compression options, universal preprocessing, LZMA-like instruction set/state machine, rep matches with relative distances, partial matches with compressed fixup sideband.

Until very recently I thought codecs in this region were impossible. (Now, just incredibly challenging!)

Another interesting place to target is to the direct right of region 3 (or directly above region 2). Target this spot too and you've redefined the entire frontier.

Interesting Recent Developments

- Key paper showing several very promising paths forward in the entropy decoding space:
- Squash benchmark - Easily explore the various frontiers with a variety of test data and CPU's.

- Squash library - Universal compression library wrapping over 30 codecs behind a single API with streaming support.

- Zstd - Promising new codec showing interesting ways of breaking up the usual monolithic decoder loop into separate blocks (i.e. it decouples entropy decoding out from the main decoder loop)

Thursday, November 26, 2015

Quick Survey of the Lossless Decompression Pareto Frontier

I first learned about the compression "Pareto Frontier" concept on Matt Mahoney's Large Text Compression Benchmark page. Those charts are for compression throughput vs. ratio, not decompression throughput vs. ratio which I personally find far more interesting. Simple charts like this allow engineers to judge at a glance what codec(s) they should consider for specific use cases.

This chart was generated by the Squash Benchmark (options selected: Core i5-2400, 20.61MB of tarred Samba source code). Using exclusively text data is sub-optimal for a comparison like this, but this was one of the larger files in the Squash corpus. The ugly circles are my loose categorizations (or clusterizations):

1. Speed Kings, compression and decompression throughput >= disk read rate
Examples: LZ4, Snappy
Typical properties: low memory, low ratio, block-based API, symmetrical (super fast compression/decompression)

2. High ratio, decompression throughput >= avg network read rate
Examples: LZMA, LZHAM, LZNA, Brotli
Properties: Asymmetrical (compression is kinda slow but tolerable), decompression is fast enough to occur in parallel with network download (so it's "free"), ideal codec has best ratio while being just fast enough to overlap with decompression

3. Godlike ratio, decompression throughput < avg network rate
Examples: PAQ series
Properties: Needs godly amounts of RAM to match its godly ratio, extremely slow (seconds per mb), symmetrical

Possible use is for data that must be transmitted to a remote destination with lots of compute but an extremely limited network or radio link (deep space?)

Note: I made 3's circle on the ratio axis very wide because in practice I've seen PAQ's ratio tank on many binary files.

4. Intermediate ratio, decompression throughput > avg_network, but < disk read rate
Examples: zlib, zstd
Major property: Symmetrical (fast compression and decompression), very low to reasonable amount of RAM

Outliers/wildcards that defy simple categorization:

Examples: Heatshrink for embedded CPU's, or RAM compression
Properties: Low ratio, work memory: fixed, extremely low, or none, code size: usually tiny


- brotli appears to have pushed the decompression frontier forward and endangered (obsoleted?) several codecs. It's even endangering several region 4 codecs (but its compressor isn't as fast as the other region 4 codecs)

Right now Brotli's compressor is still getting tuned, and will undoubtedly improve. It's currently weak on large binary files, and its max dictionary size is not big enough (so it's not as strong for large files/archives, and it'll never fare very well on huge file benchmarks until this is fixed). So its true position on the frontier is fuzzy, i.e. somewhat dependent on your source data.

- zstd is smack dab in the middle of region 4. If it moves right just a bit more (faster decompression) it's going to obsolete a bunch of codecs in its category.

If zstd's decompressor is speeded up and it gets a stronger parser it could be a formidable competitor.

Currently brotli is putting zstd in danger until zstd's decompressor is further optimized.

- brotli support should be added to 7-zip as a plugin. Actually, probably all the major Decompression Frontier leaders should be added to 7-zip because they all have value in different usage scenarios.

- LZHAM must move to the right of this graph or it's in trouble. Switching the literals, delta literals, and the match/len symbols over to using Zstd's blocked coding scheme seems like the right path forward.

- Perhaps the "Holy Grail" in practical lossless compression is a region 1 codec with  region 2-like ratio. (Is this even possible?) Maybe a highly asymmetrical codec with a hyper-fast SIMD entropy decoder could do it.

Saturday, November 21, 2015

LZHAM custom codec plugin for 7-zip v15.12

7-zip is a powerful and reliable command line and GUI archiver. I've been privately using 7-zip to thoroughly test the LZHAM codec's streaming API for several years. There's been enough interest in this plugin that I've finally decided to make it an official part of LZHAM.

Why bother using this? Because LZHAM extracts and tests much faster, around 2x-3x faster than LZMA, with similar compression ratios.

Importantly, if you create any archives using this custom codec DLL, you'll (obviously) need this DLL to also extract these archives. The LZHAM v1.x bitstream format is locked in stone, so future DLL's using newer versions of LZHAM will be forwards and backwards compatible with this release.

You can find the source to the plugin on github here. The plugin itself lives in the lzham7zip directory.


I've uploaded precompiled DLL's for the x86 and x64 versions of 7-zip v15.12 here.

To use this, create a new directory named "codecs" wherever you installed 7-zip, then copy the correct DLL (either x86 or x64) into this directory. For example, if you've installed the 32-bit version of 7-zip, extract the file LzhamCodec_x86.dll into "C:\Program Files (x86)\7-Zip\codecs". For the 64-bit version, extract it into  "C:\Program Files\7-Zip\codecs".

To verify the installation, enter "7z.exe i" in a command prompt (cmd.exe) to list all the installed codecs. You should see this:

 0  ED  6F00181 AES256CBC
 1  ED  4F71001 LZHAM

Build Instructions

If you want to compile this yourself, first grab the source code to 7-zip v15.12 and extract the archive somewhere. Next, "git clone" into this directory. Your final directory structure should be:

11/21/2015  05:00 PM    <DIR>          .
11/21/2015  05:00 PM    <DIR>          ..
11/21/2015  05:00 PM    <DIR>          Asm
11/21/2015  05:00 PM    <DIR>          bin
11/21/2015  05:00 PM    <DIR>          C
11/21/2015  05:00 PM    <DIR>          CPP
11/21/2015  05:00 PM    <DIR>          DOC
11/21/2015  05:00 PM    <DIR>          lzham_codec_devel

Now load this Visual Studio solution: lzham_codec_devel/lzham7zip/LzhamCodec.sln.

It builds with VS 2010 and VS 2015. The DLL's will be output into the bin directory.

Usage Instructions

Example command line:

7z -m0=LZHAM -ma=0 -mmt=8 -mx=9 -md=64M a temp.7z *.dll

For maximum compression, use the 64-bit version and use:

7z -m0=LZHAM -ma=0 -mmt=8 -mx=9 -md=512M a temp.7z *.dll

Command line options (also see this guide):

-m0=LZHAM - Use LZHAM for compression
-ma=[0,1] - 0=non-deterministic mode, 1=deterministic (slower)
-mmt=X - Set total # of threads to use. Default=total CPU's.
-mx=[0-9] - Compression mode, 0=fastest, 9=slowest
-md={Size}[b|k|m] - Set dictionary size, up to 64MB in x86, 512MB in x64

Unfortunately, the file manager GUI doesn't allow you to select LZHAM via the UI. Instead, you must specify custom parameters: 0=bcj 1=lzham mt=8

Note if you don't specify "mt=X", where X is the number of threads to use for compression, LZHAM will just use whatever value is in the GUI's "Number of CPU threads" pulldown (1 or 2 threads), which will be very slow.

Thursday, November 12, 2015

Visualizing the Calgary compression corpus

The Calgary corpus is a collection of text and binary files commonly used to benchmark and test lossless compression programs. It's now quite dated, but it still has some value because the corpus is so well known.

Anyhow, here are the files visualized using the same approached described in my previous blog post. The block size (each pixel) = 512 bytes.

paper1 and paper6 have some interesting shifts at the ends of each file, which corresponds to the bottom right section of the images. Turns out these are the appendixes, which have very different content vs. the rest of each paper's content.






















Wednesday, November 11, 2015

Visualization of file contents using LZ compression

I love tools that can create cool looking images out of piles of raw bits. An alternative title for this blog post could have been "A data block content similarity metric using LZ compression".

These images were created by a new tool I've been working on named "fileview", a file data visualization and debugging tool similar to Matt Mahoney's useful FV utility. FV visualizes string match lengths and distances at the byte level, while this tool visualizes the compressibility of each file block using every other block as a static dictionary. Tools like this reveal high-level file structure and the overall compressibility of each region in a file. These images were computed from single files, but it's possible to build them from two different files, too.

Each pixel represents the ratio of matched bytes for a single source file block. The block size ranges from 32KB-2MB depending on the source file's size. To compute the image, the tool compresses every block in the file against every other block, using LZ77 greedy parsing against a static dictionary built from the other block. The dictionary is not updated as the block is compressed, unlike standard LZ. Each pixel is set to 255*total_match_bytes/block_size.

The brighter a pixel is in this visualization, the more the two blocks being compared resemble each other in an LZ compression ratio sense. The first scanline shows how compressible each block is using a static dictionary built from the first block, the second scanline uses a dictionary from the 2nd block, etc.:

X axis = Block being compressed
Y axis = Static dictionary block

The diagonal pixels are all-white, which makes sense because a block LZ compressed using a static dictionary built from itself should be perfectly compressible (i.e. just one big match).

- High-resolution car image in BMP format:

Source image in PNG format:

- An old Adobe installer executable - notice the compressed data block near the beginning of the file:

- A test file from my data corpus that caused an early implementation of LZHAM codec's "best of X arrivals" parser to slow to an absolute crawl:

- enwik8 (XML text) - a highly compressible text file, so all blocks highly resemble each other:

- Test file (cp32e406.exe) from my corpus containing large amounts of compressed data with a handful of similar blocks:

- Hubble space telescope image in BMP format:

Source image in PNG format:

- Large amount of JSON text from a game:

- Unity46.exe:

- Unity demo screenshot in BMP format:

- English word list text file:

Monday, November 2, 2015

More on bsdiff and delta compression

bsdiff is a simple delta compression algorithm, and it performs well compared to its open and closed source competitors. (Note bsdiff doesn't scale to large files due to memory, but that's somewhat fixable by breaking the problem up into blocks.) It also beats LZHAM in static dictionary mode by a large margin, and I want to understand why.

It conceptually works in three high-level phases:

1. Build suffix array of original file

2. Preprocess the new file, using the suffix array to find exact or approximate matches against the original file. This results in a series of 24 byte patch control commands, followed by a "diff" block (bytes added to various regions from the original file) and an "extra" block (unmatched literal data).

3. Separately compress this data using bzip2, with a 512KB block size.

As a quick experiment, switching step 3 to LZMA or LZHAM results in easy gains (no surprise as bzip2 is pretty dated):

Original file: Unity510.exe 52,680,480
New file: Unity512.exe 52,740,568

bsdiff preprocess+lzma: 3,012,581
bsdiff preprocess+lzham_devel: 3,152,188
bsdiff preprocess+bzip2: 3,810,343
lzham_devel (in delta mode, seed dict=Unity510.exe): 4,831,025

The bsdiff patch control blocks consist of a series of Add/Copy/Skip triples (x, y, z). Each command consists of three 8-byte values:

- Add x bytes from the original file to x bytes from the diff block, write results to output
(Literally - add each individual byte.)

- Copy y bytes from the extra block to output

- Skip forwards or backwards z bytes in the original file

Most of the output consists of diff bytes in this test:

Commands: 33,111 (794,664 bytes)
Diff bytes: 51,023,396
Extra bytes: 1,717,172

The bsdiff control block data can be viewed as a sort of program that outputs the new file, given the old file and a list of "addcopyskip x,y,z" instructions along with three automatically updated machine "registers": the offsets into the old file data, the diff bytes block, and the extra bytes blocks. bsdiff then compresses this program and the two data blocks (diff+extra) individually with bzip2.

I think there are several key reasons why bsdiff works well relative to LZHAM with a static dictionary (and LZMA too, except it doesn't have explicit support for static dictionaries):

- LZHAM has no explicit way of referencing the original file (the seed dictionary). It must encode high match distances to reach back into the seed dictionary, beyond the already decoded data. This is expensive, and grows increasingly so as the dictionary grow larger.

- bsdiff is able to move around its original file pointer either forwards or backwards, by encoding the distance to travel and a sign bit. LZHAM can only used previous match distances (from the 4 entry LRU match history buffer), or it must spend many bits coding absolute match distances.

- LZHAM doesn't have delta matches (bsdiff's add operation) to efficiently encode the "second order" changes talked about in Colin Percival's thesis. The closest it gets are special handling for single byte LAM's (literals after matches), by XOR'ing the mismatch byte vs. the lookahead byte and coding the result with Huffman coding. But immediately after a LAM it always falls back to plain literals or exact matches.

- The bsdiff approach is able to apply the full bzip2 machinery to the diff bytes. In this file, most diff bytes are 0 with sprinklings of 2 or 3 byte deltas. Patterns are readily apparent in the delta data.

After some thought, one way to enhance LZHAM with stronger support for delta compression: support multibyte LAM's (perhaps by outputting the LAM length with arithmetic coding when in LAM mode), and maybe add the ability to encode REP matches with delta distances.

Or, just use LZHAM or LZMA as-is and just do the delta compression step as a preprocess, just like bzip2 does.

A few observations about Unity

I've only been at Unity Technologies for a month, and obviously I have much to learn. Here are a few things I've picked up so far:

High programmer empathy, low ego developers tend to be successful here.

Interesting company attributes: Distributed, open source style development model, team oriented, high cooperation. Automated testing, good processes, build tools, etc. are very highly valued. "Elegant code" is a serious concept at Unity.

Hiring: Flexible. Unity has local offices almost everywhere, giving it great freedom in hiring.