--- 
author: 
  email: rafl@fsfe.org
  keyid: 742f2a428e635a5e
  name: Florian Ragwitz
categories: 
  - Hacking
date: 2006-10-15T22:57:07Z
guid: 790cae08-aeab-422b-a785-43c7956110c1
modified: 2006-10-15T22:57:07Z
raw: "-----BEGIN PGP SIGNED MESSAGE-----\nHash: SHA1\n\nAfter <http://perldition.org/blog/post/453 asking the lazyweb on how to seek on gzip streams> I got a very useful and comprehensive reply by <http://www.paul.sladen.org Paul Sladen>.\r\n\r\nAs my previous searches on this topic didn't turn up to much useful things I'd like to publish Pauls reply (with his approval) here to help other people having the same problem as well:\r\n\r\n[Seeking on compressed streams is a little like reading sectors from a harddisk;  to read one byte requires reading a much large sectors.\r\n\r\nTwo things are needed, the sector size and the sector length.  On a hard-disk the sector length is nominally 512 with the mapping between an offset and sector start often as simple as:  $offset \\>\\> 9.\r\n\r\nIn compressed streams, the relationship between stream position and sector start is non-linear.  Using a non-linear mapping is more involved as searching a lookup table is required.\r\n\r\nMapping between an uncompressed position instead cannot be calculated and a 'zmap' table listing all sector start position and corresponding on-disk points needs to be created.  The size of a sector must be determined aswell.\r\n\r\n_Bzip2:_\r\n\r\nIt is fairly easy to seek on compressed bzip2 streams.  In a Bzip2 stream, each block is totally separate and self contained---have a look a the 'bzip2recover' program.  'bzip2recover' scans through a bzip2 stream locating each block/'segment' (normally 900kB of input, so ~200kB of compressed output), then copying each segment to separate bzip2 file.\r\n\r\n_Gzip:_\r\n\r\nSeeking on gzip streams is somewhat more involved;  Gzip uses a dictionary design where back-references are made to uncompressed data within the last 32kB (\"the window\").\r\n\r\nThe only safe place to start reading a 'sector' in a gzip stream is somewhere where that the dictionary size is zero.  An empty dictionary occurs at the start of a stream, or where the stream has been 'reset'.\r\n\r\n_Gzip resets:_\r\n\r\nA gzip 'sector' (the length required to be read, to safely decode a byte) could be the full length of the file-size, or could occur more frequently. A gzip sector start occurs when the stream is reset---resetting the stream more frequently leads to more sector starts, but a reduction in compression thanks to less use of the sliding dictionary.  Trade-off.\r\n\r\nThe \"gzip --rsyncable\" is an example of stream resetting;  in this case the stream is reset (and the dictionary closed) each time that the sum of the last 4096 octets of input data, modulo 65536(?) equals a magic number (zero normally).  In exchange for the extra reset/start points, you get a size increase of 4-5%, but a better \"hit rate\" for random access.\r\n\r\n_Gzip reset point lookup tables:_\r\n\r\nMore likely you'd want to reset the stream every eg. 4096kB of input, then save the corresponding position of the *output* stream for a give input position.  This lookup table of start positions needs to be stored\r\nsomewhere. 'squashfs' does in a separate file; 'zsync' in the zmap.\r\n\r\n(If you wanted to define a standard, these reset points could be stored in the 'extra' or 'comment' fields at the start of a gzip file;  storing them with a suitable header would enable some degree of seeking for sufficiently intelligent readers---files could be post-processed to add this data).\r\n\r\nStoring a LUT takes space;  adding extra reset points and storing the LUT data for these extra points takes more space---but I found storing the LUT as a set of double-deltas cut it down;  could be gzipped aswell.\r\n\r\n_Gzip partial resets:_\r\n\r\nA gzip stream doesn't just contain 'reset' points, but also 'partial reset' points, a partial resets the huffman tables, but doesn't reset the dictionary.  It's safe to _stop_ reading at a partial reset (which\r\neffectively shortens the sector length).\r\n\r\nWhilst not normally possible, partial reset points /are/ able to be used as a starting point /if/ you can populate the previous uncompressed 32kB that forms the dictionary window from another source---'zsync' does this when reconstructing a file, however 'zsync' only does construction in a linear fashion, from the start of a file.\r\n\r\n_Further Reading:_\r\n\r\nMagic words to search for:\r\n\r\n  zsync, squashfs, apt-sync, succinct\r\n\r\nSuccinct is my unfinished project.  Succinct involves many parts, one of which covers what you're doing.  It might be good to break separate off seeking of compressed streams into a separate project---many small pieces make big problems easier!\r\n\r\nWhat I might do is store the LUT in the EXTRA field of the gzip header (limited to 64kB though).  Though, within 64kB it would always be possible to construct a _useful_ LUT;  based on the final length of the input data, it would be possible to reduce the granularity of the LUT such that it only mentioned 50%, 30%... even 5% of reset points.  The LUT itself could be gzipped for extra gain.\r\n\r\nHope that's useful,\r\n\r\n        -Paul\r\n\r\nPS. In all the cases (except one) that I mentioned gzip, I probably meant 'zlib' or 'deflate'! :)] Paul Sladen:\r\n\r\nIn a later mail Paul also added the following:\r\n\r\n[BTW, in my search around a bit later for something I turned up two more things that might be interesting:\r\n\r\n  dictzip, examples/zran.c in the zlib source\r\n\r\n'dictzip' is a similar idea to what I suggested at the end of the previous email.  A lookup table is built and stored in the EXTRA section at the start of the gzip file and the utilities ('dictzcat' et al) then use this look-up table for random-access on the data.\r\n\r\n'dictzip' are still slightly wide of the mark, forcing a flush/chunk every 58969 bytes.  That number is chosen so that even in the worst case, the compressed version will never be more than 2**16 in length.  However I think their math is slightly wrong as even in the worse case, gzip only generates 5 bytes per 32kB for an uncompresed store. -\\> ~65514 by my reckoning. Squashfs uncompressable chunks by setting the top bit of the file offset to indicate a true store.\r\n\r\nTheir lookup table format appears to be flexible enough that I /might/ be able to create a table by scanning an existing stream rather than re-encoding.  Depends about the 64kB limitation.  Mmm.] Paul Sladen:\r\n\r\n_*Thanks Paul!*_\r\n-----BEGIN PGP SIGNATURE-----\nVersion: GnuPG v1.4.6 (GNU/Linux)\n\niD8DBQFGCP+mdC8qQo5jWl4RAsSCAJ47lJp8xZjwq/JHmKXnhB2MQDiKkgCfaYKw\nOynS7nBnlFauWvkZWLxIgWk=\n=wAgS\n-----END PGP SIGNATURE-----\n"
signed: 1
summary: " After asking the lazyweb on how to seek on …"
tags: []

text: "\nAfter asking the lazyweb on how to seek on gzip streams [1] I got a very\nuseful and comprehensive reply by Paul Sladen [2].\n\nAs my previous searches on this topic didn't turn up to much useful\nthings I'd like to publish Pauls reply (with his approval) here to help\nother people having the same problem as well: Paul Sladen: Seeking on\ncompressed streams is a little like reading sectors from a harddisk; to\nread one byte requires reading a much large sectors.\n\nTwo things are needed, the sector size and the sector length. On a\nhard-disk the sector length is nominally 512 with the mapping between an\noffset and sector start often as simple as: $offset >> 9.\n\nIn compressed streams, the relationship between stream position and sec-\ntor start is non-linear. Using a non-linear mapping is more involved as\nsearching a lookup table is required.\n\nMapping between an uncompressed position instead cannot be calculated\nand a 'zmap' table listing all sector start position and corresponding\non-disk points needs to be created. The size of a sector must be deter-\nmined aswell.\n\nBzip2:\n\nIt is fairly easy to seek on compressed bzip2 streams. In a Bzip2\nstream, each block is totally separate and self contained---have a look\na the 'bzip2recover' program. 'bzip2recover' scans through a bzip2\nstream locating each block/'segment' (normally 900kB of input, so ~200kB\nof compressed output), then copying each segment to separate bzip2 file.\n\nGzip:\n\nSeeking on gzip streams is somewhat more involved; Gzip uses a dic-\ntionary design where back-references are made to uncompressed data with-\nin the last 32kB (\"the window\").\n\nThe only safe place to start reading a 'sector' in a gzip stream is\nsomewhere where that the dictionary size is zero. An empty dictionary\noccurs at the start of a stream, or where the stream has been 'reset'.\n\nGzip resets:\n\nA gzip 'sector' (the length required to be read, to safely decode a\nbyte) could be the full length of the file-size, or could occur more\nfrequently. A gzip sector start occurs when the stream is reset---reset-\nting the stream more frequently leads to more sector starts, but a re-\nduction in compression thanks to less use of the sliding dictionary.\nTrade-off.\n\nThe \"gzip --rsyncable\" is an example of stream resetting; in this\ncase the stream is reset (and the dictionary closed) each time that\nthe sum of the last 4096 octets of input data, modulo 65536(?) equals\na magic number (zero normally). In exchange for the extra reset/start\npoints, you get a size increase of 4-5%, but a better \"hit rate\" for\nrandom access.\n\nGzip reset point lookup tables:\n\nMore likely you'd want to reset the stream every eg. 4096kB of input,\nthen save the corresponding position of the output stream for a give in-\nput position. This lookup table of start positions needs to be stored\n\nsomewhere. 'squashfs' does in a separate file; 'zsync' in the zmap.\n\n(If you wanted to define a standard, these reset points could be stored\nin the 'extra' or 'comment' fields at the start of a gzip file; storing\nthem with a suitable header would enable some degree of seeking for suf-\nficiently intelligent readers---files could be post-processed to add\nthis data).\n\nStoring a LUT takes space; adding extra reset points and storing the LUT\ndata for these extra points takes more space---but I found storing the\nLUT as a set of double-deltas cut it down; could be gzipped aswell.\n\nGzip partial resets:\n\nA gzip stream doesn't just contain 'reset' points, but also 'partial re-\nset' points, a partial resets the huffman tables, but doesn't reset the\ndictionary. It's safe to stop reading at a partial reset (which\n\neffectively shortens the sector length).\n\nWhilst not normally possible, partial reset points /are/ able to be used\nas a starting point /if/ you can populate the previous uncompressed 32kB\nthat forms the dictionary window from another source---'zsync' does this\nwhen reconstructing a file, however 'zsync' only does construction in a\nlinear fashion, from the start of a file.\n\nFurther Reading:\n\nMagic words to search for:\n\n zsync, squashfs, apt-sync, succinct\n\nSuccinct is my unfinished project. Succinct involves many parts, one of\nwhich covers what you're doing. It might be good to break separate off\nseeking of compressed streams into a separate project---many small\npieces make big problems easier!\n\nWhat I might do is store the LUT in the EXTRA field of the gzip header\n(limited to 64kB though). Though, within 64kB it would always be possi-\nble to construct a useful LUT; based on the final length of the input\ndata, it would be possible to reduce the granularity of the LUT such\nthat it only mentioned 50%, 30%... even 5% of reset points. The LUT it-\nself could be gzipped for extra gain.\n\nHope that's useful,\n\n -Paul\n\nPS. In all the cases (except one) that I mentioned gzip, I probably\n    meant 'zlib' or 'deflate'! :)\n\nIn a later mail Paul also added the following: Paul Sladen: BTW, in my\nsearch around a bit later for something I turned up two more things that\nmight be interesting:\n\n dictzip, examples/zran.c in the zlib source\n\n'dictzip' is a similar idea to what I suggested at the end of the previ-\nous email. A lookup table is built and stored in the EXTRA section at\nthe start of the gzip file and the utilities ('dictzcat' et al) then use\nthis look-up table for random-access on the data.\n\n'dictzip' are still slightly wide of the mark, forcing a flush/chunk ev-\nery 58969 bytes. That number is chosen so that even in the worst case,\nthe compressed version will never be more than 216 in length. However I\nthink their math is slightly wrong as even in the worse case, gzip only\ngenerates 5 bytes per 32kB for an uncompresed store. -> ~65514 by my\nreckoning. Squashfs uncompressable chunks by setting the top bit of the\nfile offset to indicate a true store.\n\nTheir lookup table format appears to be flexible enough that I /might/\nbe able to create a table by scanning an existing stream rather than\nre-encoding. Depends about the 64kB limitation. Mmm.\n\nThanks Paul!\n\n-- \n [1] http://perldition.org/blog/post/453\n [2] http://www.paul.sladen.org\n"
title: Random seeking on gzip streams
type: sbc
uri: http://perldition.org/articles/Random%20seeking%20on%20gzip%20streams.sbc
xhtml: "<p>After <a href=\"http://perldition.org/blog/post/453\">asking the lazyweb on how to seek on gzip streams</a> I got a very useful and comprehensive reply by <a href=\"http://www.paul.sladen.org\">Paul Sladen</a>.</p><p>As my previous searches on this topic didn&apos;t turn up to much useful things I&apos;d like to publish Pauls reply (with his approval) here to help other people having the same problem as well:</p><blockquote cite=\"Paul Sladen:\"><p>Seeking on compressed streams is a little like reading sectors from a harddisk; to read one byte requires reading a much large sectors.</p><p>Two things are needed, the sector size and the sector length. On a hard-disk the sector length is nominally 512 with the mapping between an offset and sector start often as simple as: $offset &gt;&gt; 9.</p><p>In compressed streams, the relationship between stream position and sector start is non-linear. Using a non-linear mapping is more involved as searching a lookup table is required.</p><p>Mapping between an uncompressed position instead cannot be calculated and a &apos;zmap&apos; table listing all sector start position and corresponding on-disk points needs to be created. The size of a sector must be determined aswell.</p><p><strong>Bzip2:</strong></p><p>It is fairly easy to seek on compressed bzip2 streams. In a Bzip2 stream, each block is totally separate and self contained---have a look a the &apos;bzip2recover&apos; program. &apos;bzip2recover&apos; scans through a bzip2 stream locating each block/&apos;segment&apos; (normally 900kB of input, so ~200kB of compressed output), then copying each segment to separate bzip2 file.</p><p><strong>Gzip:</strong></p><p>Seeking on gzip streams is somewhat more involved; Gzip uses a dictionary design where back-references are made to uncompressed data within the last 32kB (&quot;the window&quot;).</p><p>The only safe place to start reading a &apos;sector&apos; in a gzip stream is somewhere where that the dictionary size is zero. An empty dictionary occurs at the start of a stream, or where the stream has been &apos;reset&apos;.</p><p><strong>Gzip resets:</strong></p><p>A gzip &apos;sector&apos; (the length required to be read, to safely decode a byte) could be the full length of the file-size, or could occur more frequently. A gzip sector start occurs when the stream is reset---resetting the stream more frequently leads to more sector starts, but a reduction in compression thanks to less use of the sliding dictionary. Trade-off.</p><p>The &quot;gzip --rsyncable&quot; is an example of stream resetting; in this case the stream is reset (and the dictionary closed) each time that the sum of the last 4096 octets of input data, modulo 65536(?) equals a magic number (zero normally). In exchange for the extra reset/start points, you get a size increase of 4-5%, but a better &quot;hit rate&quot; for random access.</p><p><strong>Gzip reset point lookup tables:</strong></p><p>More likely you&apos;d want to reset the stream every eg. 4096kB of input, then save the corresponding position of the output stream for a give input position. This lookup table of start positions needs to be stored</p><p>somewhere. &apos;squashfs&apos; does in a separate file; &apos;zsync&apos; in the zmap.</p><p>(If you wanted to define a standard, these reset points could be stored in the &apos;extra&apos; or &apos;comment&apos; fields at the start of a gzip file; storing them with a suitable header would enable some degree of seeking for sufficiently intelligent readers---files could be post-processed to add this data).</p><p>Storing a LUT takes space; adding extra reset points and storing the LUT data for these extra points takes more space---but I found storing the LUT as a set of double-deltas cut it down; could be gzipped aswell.</p><p><strong>Gzip partial resets:</strong></p><p>A gzip stream doesn&apos;t just contain &apos;reset&apos; points, but also &apos;partial reset&apos; points, a partial resets the huffman tables, but doesn&apos;t reset the dictionary. It&apos;s safe to <strong>stop</strong> reading at a partial reset (which</p><p>effectively shortens the sector length).</p><p>Whilst not normally possible, partial reset points /are/ able to be used as a starting point /if/ you can populate the previous uncompressed 32kB that forms the dictionary window from another source---&apos;zsync&apos; does this when reconstructing a file, however &apos;zsync&apos; only does construction in a linear fashion, from the start of a file.</p><p><strong>Further Reading:</strong></p><p>Magic words to search for:</p><p> zsync, squashfs, apt-sync, succinct</p><p>Succinct is my unfinished project. Succinct involves many parts, one of which covers what you&apos;re doing. It might be good to break separate off seeking of compressed streams into a separate project---many small pieces make big problems easier!</p><p>What I might do is store the LUT in the EXTRA field of the gzip header (limited to 64kB though). Though, within 64kB it would always be possible to construct a <strong>useful</strong> LUT; based on the final length of the input data, it would be possible to reduce the granularity of the LUT such that it only mentioned 50%, 30%... even 5% of reset points. The LUT itself could be gzipped for extra gain.</p><p>Hope that&apos;s useful,</p><p> -Paul</p><p>PS. In all the cases (except one) that I mentioned gzip, I probably meant &apos;zlib&apos; or &apos;deflate&apos;! :)</p></blockquote><p>In a later mail Paul also added the following:</p><blockquote cite=\"Paul Sladen:\"><p>BTW, in my search around a bit later for something I turned up two more things that might be interesting:</p><p> dictzip, examples/zran.c in the zlib source</p><p>&apos;dictzip&apos; is a similar idea to what I suggested at the end of the previous email. A lookup table is built and stored in the EXTRA section at the start of the gzip file and the utilities (&apos;dictzcat&apos; et al) then use this look-up table for random-access on the data.</p><p>&apos;dictzip&apos; are still slightly wide of the mark, forcing a flush/chunk every 58969 bytes. That number is chosen so that even in the worst case, the compressed version will never be more than 216 in length. However I think their math is slightly wrong as even in the worse case, gzip only generates 5 bytes per 32kB for an uncompresed store. -&gt; ~65514 by my reckoning. Squashfs uncompressable chunks by setting the top bit of the file offset to indicate a true store.</p><p>Their lookup table format appears to be flexible enough that I /might/ be able to create a table by scanning an existing stream rather than re-encoding. Depends about the 64kB limitation. Mmm.</p></blockquote><p><strong>Thanks Paul!</strong></p>"
