Saturday, November 21, 2015

LZHAM custom codec plugin for 7-zip v15.12

7-zip is a powerful and reliable command line and GUI archiver. I've been privately using 7-zip to thoroughly test the LZHAM codec's streaming API for several years. There's been enough interest in this plugin that I've finally decided to make it an official part of LZHAM.

Why bother using this? Because LZHAM extracts and tests much faster, around 2x-3x faster than LZMA, with similar compression ratios.

Importantly, if you create any archives using this custom codec DLL, you'll (obviously) need this DLL to also extract these archives. The LZHAM v1.x bitstream format is locked in stone, so future DLL's using newer versions of LZHAM will be forwards and backwards compatible with this release.

You can find the source to the plugin on github here. The plugin itself lives in the lzham7zip directory.


I've uploaded precompiled DLL's for the x86 and x64 versions of 7-zip v15.12 here.

To use this, create a new directory named "codecs" wherever you installed 7-zip, then copy the correct DLL (either x86 or x64) into this directory. For example, if you've installed the 32-bit version of 7-zip, extract the file LzhamCodec_x86.dll into "C:\Program Files (x86)\7-Zip\codecs". For the 64-bit version, extract it into  "C:\Program Files\7-Zip\codecs".

To verify the installation, enter "7z.exe i" in a command prompt (cmd.exe) to list all the installed codecs. You should see this:

 0  ED  6F00181 AES256CBC
 1  ED  4F71001 LZHAM

Build Instructions

If you want to compile this yourself, first grab the source code to 7-zip v15.12 and extract the archive somewhere. Next, "git clone" into this directory. Your final directory structure should be:

11/21/2015  05:00 PM    <DIR>          .
11/21/2015  05:00 PM    <DIR>          ..
11/21/2015  05:00 PM    <DIR>          Asm
11/21/2015  05:00 PM    <DIR>          bin
11/21/2015  05:00 PM    <DIR>          C
11/21/2015  05:00 PM    <DIR>          CPP
11/21/2015  05:00 PM    <DIR>          DOC
11/21/2015  05:00 PM    <DIR>          lzham_codec_devel

Now load this Visual Studio solution: lzham_codec_devel/lzham7zip/LzhamCodec.sln.

It builds with VS 2010 and VS 2015. The DLL's will be output into the bin directory.

Usage Instructions

Example command line:

7z -m0=LZHAM -ma=0 -mmt=8 -mx=9 -md=64M a temp.7z *.dll

For maximum compression, use the 64-bit version and use:

7z -m0=LZHAM -ma=0 -mmt=8 -mx=9 -md=512M a temp.7z *.dll

Command line options (also see this guide):

-m0=LZHAM - Use LZHAM for compression
-ma=[0,1] - 0=non-deterministic mode, 1=deterministic (slower)
-mmt=X - Set total # of threads to use. Default=total CPU's.
-mx=[0-9] - Compression mode, 0=fastest, 9=slowest
-md={Size}[b|k|m] - Set dictionary size, up to 64MB in x86, 512MB in x64

Unfortunately, the file manager GUI doesn't allow you to select LZHAM via the UI. Instead, you must specify custom parameters: 0=bcj 1=lzham mt=8

Note if you don't specify "mt=X", where X is the number of threads to use for compression, LZHAM will just use whatever value is in the GUI's "Number of CPU threads" pulldown (1 or 2 threads), which will be very slow.

Thursday, November 12, 2015

Visualizing the Calgary compression corpus

The Calgary corpus is a collection of text and binary files commonly used to benchmark and test lossless compression programs. It's now quite dated, but it still has some value because the corpus is so well known.

Anyhow, here are the files visualized using the same approached described in my previous blog post. The block size (each pixel) = 512 bytes.

paper1 and paper6 have some interesting shifts at the ends of each file, which corresponds to the bottom right section of the images. Turns out these are the appendixes, which have very different content vs. the rest of each paper's content.






















Wednesday, November 11, 2015

Visualization of file contents using LZ compression

I love tools that can create cool looking images out of piles of raw bits.

These images were created by a new tool I've been working on named "fileview", a file data visualization and debugging tool similar to Matt Mahoney's useful FV utility. FV visualizes string match lengths and distances at the byte level, while this tool visualizes the compressibility of each file block using every other block as a static dictionary. Tools like this reveal high-level file structure and the overall compressibility of each region in a file. These images were computed from single files, but it's possible to build them from two different files, too.

Each pixel represents the ratio of matched bytes for a single source file block. The block size ranges from 32KB-2MB depending on the source file's size. To compute the image, the tool compresses every block in the file against every other block, using LZ77 greedy parsing against a static dictionary built from the other block. The dictionary is not updated as the block is compressed, unlike standard LZ. Each pixel is set to 255*total_match_bytes/block_size.

The brighter a pixel is in this visualization, the more the two blocks being compared resemble each other in an LZ compression ratio sense. The first scanline shows how compressible each block is using a static dictionary built from the first block, the second scanline uses a dictionary from the 2nd block, etc.:

X axis = Block being compressed
Y axis = Static dictionary block

The diagonal pixels are all-white, which makes sense because a block LZ compressed using a static dictionary built from itself should be perfectly compressible (i.e. just one big match).

- High-resolution car image in BMP format:

Source image in PNG format:

- An old Adobe installer executable - notice the compressed data block near the beginning of the file:

- A test file from my data corpus that caused an early implementation of LZHAM codec's "best of X arrivals" parser to slow to an absolute crawl:

- enwik8 (XML text) - a highly compressible text file, so all blocks highly resemble each other:

- Test file (cp32e406.exe) from my corpus containing large amounts of compressed data with a handful of similar blocks:

- Hubble space telescope image in BMP format:

Source image in PNG format:

- Large amount of JSON text from a game:

- Unity46.exe:

- Unity demo screenshot in BMP format:

- English word list text file:

Monday, November 2, 2015

More on bsdiff and delta compression

bsdiff is a simple delta compression algorithm, and it performs well compared to its open and closed source competitors. (Note bsdiff doesn't scale to large files due to memory, but that's somewhat fixable by breaking the problem up into blocks.) It also beats LZHAM in static dictionary mode by a large margin, and I want to understand why.

It conceptually works in three high-level phases:

1. Build suffix array of original file

2. Preprocess the new file, using the suffix array to find exact or approximate matches against the original file. This results in a series of 24 byte patch control commands, followed by a "diff" block (bytes added to various regions from the original file) and an "extra" block (unmatched literal data).

3. Separately compress this data using bzip2, with a 512KB block size.

As a quick experiment, switching step 3 to LZMA or LZHAM results in easy gains (no surprise as bzip2 is pretty dated):

Original file: Unity510.exe 52,680,480
New file: Unity512.exe 52,740,568

bsdiff preprocess+lzma: 3,012,581
bsdiff preprocess+lzham_devel: 3,152,188
bsdiff preprocess+bzip2: 3,810,343
lzham_devel (in delta mode, seed dict=Unity510.exe): 4,831,025

The bsdiff patch control blocks consist of a series of Add/Copy/Skip triples (x, y, z). Each command consists of three 8-byte values:

- Add x bytes from the original file to x bytes from the diff block, write results to output
(Literally - add each individual byte.)

- Copy y bytes from the extra block to output

- Skip forwards or backwards z bytes in the original file

Most of the output consists of diff bytes in this test:

Commands: 33,111 (794,664 bytes)
Diff bytes: 51,023,396
Extra bytes: 1,717,172

The bsdiff control block data can be viewed as a sort of program that outputs the new file, given the old file and a list of "addcopyskip x,y,z" instructions along with three automatically updated machine "registers": the offsets into the old file data, the diff bytes block, and the extra bytes blocks. bsdiff then compresses this program and the two data blocks (diff+extra) individually with bzip2.

I think there are several key reasons why bsdiff works well relative to LZHAM with a static dictionary (and LZMA too, except it doesn't have explicit support for static dictionaries):

- LZHAM has no explicit way of referencing the original file (the seed dictionary). It must encode high match distances to reach back into the seed dictionary, beyond the already decoded data. This is expensive, and grows increasingly so as the dictionary grow larger.

- bsdiff is able to move around its original file pointer either forwards or backwards, by encoding the distance to travel and a sign bit. LZHAM can only used previous match distances (from the 4 entry LRU match history buffer), or it must spend many bits coding absolute match distances.

- LZHAM doesn't have delta matches (bsdiff's add operation) to efficiently encode the "second order" changes talked about in Colin Percival's thesis. The closest it gets are special handling for single byte LAM's (literals after matches), by XOR'ing the mismatch byte vs. the lookahead byte and coding the result with Huffman coding. But immediately after a LAM it always falls back to plain literals or exact matches.

- The bsdiff approach is able to apply the full bzip2 machinery to the diff bytes. In this file, most diff bytes are 0 with sprinklings of 2 or 3 byte deltas. Patterns are readily apparent in the delta data.

After some thought, one way to enhance LZHAM with stronger support for delta compression: support multibyte LAM's (perhaps by outputting the LAM length with arithmetic coding when in LAM mode), and maybe add the ability to encode REP matches with delta distances.

Or, just use LZHAM or LZMA as-is and just do the delta compression step as a preprocess, just like bzip2 does.

A few observations about Unity

I've only been at Unity Technologies for a month, and obviously I have much to learn. Here are a few things I've picked up so far:

High programmer empathy, low ego developers tend to be successful here.

Interesting company attributes: Distributed, open source style development model, team oriented, high cooperation. Automated testing, good processes, build tools, etc. are very highly valued. "Elegant code" is a serious concept at Unity.

Hiring: Flexible. Unity has local offices almost everywhere, giving it great freedom in hiring.

Saturday, October 31, 2015

15 Reasons why developers won't use your awesome codec

Getting other developers to use your code in their products is surprisingly difficult.  A lot of this applies to open source development in general:

1. "Nobody's ever been fired for choosing IBM."
The codec is perceived to be "too new" (i.e. less than 5-10 or whatever years old), and the developer is just afraid to adopt it.

2. Inertia: The developer just has a super favorite codec they believe is useful for every scenario and they've already hooked it up, so they don't want to make the investment into switching to something else.

3. Politics: The developer irrationally refuses to even look at your codec because you work for the same company as they do.

4. Perceived lack of support: If the codec fails in the field, who's gonna help us debug it?

(I believe this is one reason why Rad's compression products are perceived to offer enough value to be worth purchasing.)

5. Linus hates C++: The codec is written in C++, so it's obviously crap.

6. Bad packaging: It's not wrapped up into a trivial to compile and use library, with a dead simple C-style API.

7. Too exotic: The developer just doesn't understand it.

8. Untested on mobile: It's been designed and tested on a desktop CPU, with unknown mobile performance (if it compiles/works at all).

9. Memory usage: It's perceived to use too much RAM.

10. Executable size: The codec's executable size is perceived to be too large.

11. Lacks some feature(s): The codec doesn't support streaming, or the decompressor doesn't support in-place decompression.

12. Performance: The thing is #1 on the compression ratio charts, but it's just stupendously slow.

13. Robustness: Your codec failed us once and so we no longer trust it.

14. Licensing issues: GPL, LGPL, etc. is the kiss of death.

15. Patent FUD: A patent troll has mentioned that your codec may be infringing on one of their patents, so the world is now afraid to use it. It doesn't matter if your codec doesn't actually infringe.

Thursday, October 29, 2015

Grab bag list of LZHAM improvements

Putting this somewhere so I don't forget it:

Patching specific:

- Allow the decompressor to refer to a static dictionary anywhere in memory (i.e. no memcpy() needed into the decompressor's dictionary buffer)

- bsdiff is so simple looking. Why does it work so well when patching very similar files?

- For patching, investigate allowing the compressor to emit "delta rep" matches, or rep matches with an optional plus/minus distances.

- Use Matt Mahoney's FV tool on patch file scenarios.

- Add a visualization mode to LZHAM's compressor, like FV.

General improvements:

- Symbol price accuracy: the parsing jobs uses approximate symbol statistics that are locked at the beginning of blocks. Examine how inaccurate these statistics really are.

- Try to SIMD the Huffman table update routes.

- The codec is too focused on the totally general case, which is streaming. Many useful scenarios do not involve streaming at all.

Add a "small file" optimization mode to LZHAM's compressor, which can be used as a hint to the compressor that it's OK to try encoding the first block in multiple ways.

Add a decompressor variant that doesn't support streaming at all, to reduce the # of decompressor basic blocks (idea from John Brooks).

- Add a compile time option that reduces the size of the decompressor as much as possible, even if it sacrifices perf.

- The big Huffman tables (like for literals, delta literals, match len/dist categories) are all initialized to symbol frequencies of 1's, which is wasteful for things like text files. Implement some way for the compressor to have control over this, like escape codes to jam small regions of the symbol table frequencies to 1's, or perhaps configuration bits at the start of the stream.

- LZHAM's Huffman table decoding fastbits (symbol decoding acceleration) table is too large on very small streams (an observation due to Charles Bloom). The decoder's should start with small tables and grow them over time.

- From John Brooks: Combine Deflate's idea of storing compressed Huffman table codelengths in the data stream with LZHAM's current approach of rebuilding the tables over time. At the start of streams, use compressed codelengths, then switch to dynamic.

- Add a configuration bit (global or per block?) to completely disable rep matches, which I believe will help a small amount on text files. Have the compressor try this optimization out.

- Steal the idea of global configuration settings from LZMA that tweak some of its prediction models, so the user can call LZHAM multiple times with different settings and choose the smallest results.

- There are many compromise decisions in LZHAM. For example, the decompressor state machine transition table can't be varied, the bitwise arithmetic coder's adaption rate can't be varied, and the Huffman table update interval is only user controllable. Allowing the compressor to optimize along these axis can result in gains.

- Further profile and improve LZHAM's small block perf (reduce startup cost, increase throughput near start of streams).

Platform specific:

- Get Emscripten compilation of the decompressor working, for Javascript support

- Deeply examine and optimize the generated assembly of the decompressor on ARM


- Just dump arithmetic and Huffman coding and just switch to something like rANS. I think I'll need to do this to ultimately compete against Brotli.

Very long term pie in the sky stuff:

I have a frighteningly complex ROLZ branch of LZHAM alpha somewhere that works. Re-evaluate it.