Skip to content

Quickly pack wiki history dumps, or anything with many long repeated sections (think 100+ bytes) relatively long distances apart (think up to a few MB)

License

Notifications You must be signed in to change notification settings

twotwotwo/histzip

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

histzip

Hey! Thanks for your interest, but consider using something like brotli or zstd instead. Those modern compressors also compress quickly with long history windows (the problem histzip was written to solve) but are much better supported; you should be able to achieve better speed and compression ratios with them as well.


Compress wiki change histories, etc. Made for any input with long duplicated sections (100+ bytes) up to a few MB apart.

Download binaries for Linux amd64, Linux x86, Windows 64-bit and 32-bit, and Mac 64-bit. For faster compression try the gccgo Linux amd64 build.

Compress by piping text through histzip and bzip2 or similar:

./histzip < revisions.xml | bzip2 > revisions.xml.hbz

Turn that around to decompress:

bunzip2 < revisions.xml.hbz | ./histzip > revisions.xml

Running on dumps of English Wikipedia's history, that pipeline ran at 51 MB/s for the newest chunk and 151 MB/s for the oldest. Compression ratios were comparable to 7zip's: 8% worse for the new chunk and 10% better for the old chunk.

While compressing, histzip decompresses its output and compares checksums as a self-check. There are write-ups of the framing format and the format for compressed data. You can use the same compression engine in other programs via the histzip/lrcompress library.

If you're interested in long-range compression, some other projects might interest you. rzip is awesome; histzip lifts some implementation tricks directly from it. bm is a Bentley-McIlroy library by CloudFlare also written in Go, compressing matches against a fixed dictionary (in essence, performing a binary diff). Git, xdelta, and open-vcdiff also each have open-source binary diff implementations. Google's Brotli compressor is a tuned flate variant defaulting to a 4MB window, first used to compress WOFF 2.0 Web fonts.

If you have any trouble or questions, get in touch!

Public domain, Randall Farmer, 2013-4.

About

Quickly pack wiki history dumps, or anything with many long repeated sections (think 100+ bytes) relatively long distances apart (think up to a few MB)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages