Random seeking on gzip streams
After asking the lazyweb on how to seek on gzip streams I got a very useful and comprehensive reply by Paul Sladen.
As my previous searches on this topic didn't turn up to much useful things I'd like to publish Pauls reply (with his approval) here to help other people having the same problem as well:
Seeking on compressed streams is a little like reading sectors from a harddisk; to read one byte requires reading a much large sectors.
Two things are needed, the sector size and the sector length. On a hard-disk the sector length is nominally 512 with the mapping between an offset and sector start often as simple as: $offset >> 9.
In compressed streams, the relationship between stream position and sector start is non-linear. Using a non-linear mapping is more involved as searching a lookup table is required.
Mapping between an uncompressed position instead cannot be calculated and a 'zmap' table listing all sector start position and corresponding on-disk points needs to be created. The size of a sector must be determined aswell.
Bzip2:
It is fairly easy to seek on compressed bzip2 streams. In a Bzip2 stream, each block is totally separate and self contained---have a look a the 'bzip2recover' program. 'bzip2recover' scans through a bzip2 stream locating each block/'segment' (normally 900kB of input, so ~200kB of compressed output), then copying each segment to separate bzip2 file.
Gzip:
Seeking on gzip streams is somewhat more involved; Gzip uses a dictionary design where back-references are made to uncompressed data within the last 32kB ("the window").
The only safe place to start reading a 'sector' in a gzip stream is somewhere where that the dictionary size is zero. An empty dictionary occurs at the start of a stream, or where the stream has been 'reset'.
Gzip resets:
A gzip 'sector' (the length required to be read, to safely decode a byte) could be the full length of the file-size, or could occur more frequently. A gzip sector start occurs when the stream is reset---resetting the stream more frequently leads to more sector starts, but a reduction in compression thanks to less use of the sliding dictionary. Trade-off.
The "gzip --rsyncable" is an example of stream resetting; in this case the stream is reset (and the dictionary closed) each time that the sum of the last 4096 octets of input data, modulo 65536(?) equals a magic number (zero normally). In exchange for the extra reset/start points, you get a size increase of 4-5%, but a better "hit rate" for random access.
Gzip reset point lookup tables:
More likely you'd want to reset the stream every eg. 4096kB of input, then save the corresponding position of the output stream for a give input position. This lookup table of start positions needs to be stored
somewhere. 'squashfs' does in a separate file; 'zsync' in the zmap.
(If you wanted to define a standard, these reset points could be stored in the 'extra' or 'comment' fields at the start of a gzip file; storing them with a suitable header would enable some degree of seeking for sufficiently intelligent readers---files could be post-processed to add this data).
Storing a LUT takes space; adding extra reset points and storing the LUT data for these extra points takes more space---but I found storing the LUT as a set of double-deltas cut it down; could be gzipped aswell.
Gzip partial resets:
A gzip stream doesn't just contain 'reset' points, but also 'partial reset' points, a partial resets the huffman tables, but doesn't reset the dictionary. It's safe to stop reading at a partial reset (which
effectively shortens the sector length).
Whilst not normally possible, partial reset points /are/ able to be used as a starting point /if/ you can populate the previous uncompressed 32kB that forms the dictionary window from another source---'zsync' does this when reconstructing a file, however 'zsync' only does construction in a linear fashion, from the start of a file.
Further Reading:
Magic words to search for:
zsync, squashfs, apt-sync, succinct
Succinct is my unfinished project. Succinct involves many parts, one of which covers what you're doing. It might be good to break separate off seeking of compressed streams into a separate project---many small pieces make big problems easier!
What I might do is store the LUT in the EXTRA field of the gzip header (limited to 64kB though). Though, within 64kB it would always be possible to construct a useful LUT; based on the final length of the input data, it would be possible to reduce the granularity of the LUT such that it only mentioned 50%, 30%... even 5% of reset points. The LUT itself could be gzipped for extra gain.
Hope that's useful,
-Paul
PS. In all the cases (except one) that I mentioned gzip, I probably meant 'zlib' or 'deflate'! :)
In a later mail Paul also added the following:
BTW, in my search around a bit later for something I turned up two more things that might be interesting:
dictzip, examples/zran.c in the zlib source
'dictzip' is a similar idea to what I suggested at the end of the previous email. A lookup table is built and stored in the EXTRA section at the start of the gzip file and the utilities ('dictzcat' et al) then use this look-up table for random-access on the data.
'dictzip' are still slightly wide of the mark, forcing a flush/chunk every 58969 bytes. That number is chosen so that even in the worst case, the compressed version will never be more than 216 in length. However I think their math is slightly wrong as even in the worse case, gzip only generates 5 bytes per 32kB for an uncompresed store. -> ~65514 by my reckoning. Squashfs uncompressable chunks by setting the top bit of the file offset to indicate a true store.
Their lookup table format appears to be flexible enough that I /might/ be able to create a table by scanning an existing stream rather than re-encoding. Depends about the 64kB limitation. Mmm.
Thanks Paul!
Plat_Forms: The web development platform comparison
Plat_Forms is an international programming contest. It aims at comparing different technological platforms for developing web-based applications: Java EE, .NET, PHP, Perl, Python, Ruby-on-Rails
I really love the idea behind that project and after some Perl people complained they now also accept submissions from Perl developers. So if you live in or around Germany and would like to do some propaganda for your favorite web framework, apply right now, as long as your favorite web framework is based on Perl. :-)
Seeking on gzip streams?
Dear Lazyweb,
do you know anything that implements seeking on gzip/zlib streams?
Seeking forward is easily done by reading the data up to the requested position, which isn't nice, but works well. Seeking backwards may be implemented by something like this:
inflateEnd (&stream); fseek (input, 0, SEEK_SET); inflateInit (&stream); /* read up to the requested position */
This will be terribly slow for some cases, but well..
I'm just wondering how seeking with SEEK_END could be implemented without inflating the whole stream at first.
xmms2_0.2DrGonzo-1
The xmms2 project released 0.2DrGonzo recently.
I've prepared packages for it, but as the new release adds two new plugins I needed to add two new binary packages. Therefor it'll need to go through the NEW queue again.
In the meantime you can grab the source package and build it yourself for your architecture:
Perldition, a small Blog and CMS, written in Perl
Until a few days ago my website was driven by PodCMS, which allowed me to manage all of the content as directories and files containing Pod (Plain Old Documentation). Unfortunately that wasn't quite flexible enough and didn't allow some features, like comments, tags and trackbacks, to be implemented easily. Also Pod sucks for some sort of content, as there's no satisfying Pod2Html module on CPAN as it seems.
Therefor I decided to create something new. The new system has all features the old one had, but now allows to create content in lots of formats such as:
- Pod
- Markdown
- Sbc
- Textile
- HTML
Other markup formats are possible as well, as the API for the formatting plugins is quite easy and usually just a thin wrapper around a CPAN module which does the actual translation to HTML.
Beside allowing new formats to write the content in, it also adds the following features:
- Comments
- Tags
- RSS feeds for comments and tags
- Trackbacks
- Pingbacks
- MoveableType API
- Manage static pages in an easy, tree-like fashion
- An easy web interface for editing everything. No need to work with files and directories anymore.
- Uses a relational database and is portable between many of them through DBIx::Class (I developed it using SQLite and deployed it on PostgreSQL)
In conclusion I'm pretty happy with the new software. I'm just very disappointed by quality of the generated HTML that the various Pod2HTML modules on CPAN produce, so I'll probably end up in writing something myself, based on Pod::Parser.
PS: The URL to rss feed changed. Please use http://perldition.org/blog.rss.