Random seeking on gzip streams

Posted on 2006-10-16 (月) at 12:57 am by Unknown Name [SIGNED]

Posted in: Hacking

After asking the lazyweb on how to seek on gzip streams I got a very useful and comprehensive reply by Paul Sladen.

As my previous searches on this topic didn't turn up to much useful things I'd like to publish Pauls reply (with his approval) here to help other people having the same problem as well:

Seeking on compressed streams is a little like reading sectors from a harddisk; to read one byte requires reading a much large sectors.

Two things are needed, the sector size and the sector length. On a hard-disk the sector length is nominally 512 with the mapping between an offset and sector start often as simple as: $offset >> 9.

In compressed streams, the relationship between stream position and sector start is non-linear. Using a non-linear mapping is more involved as searching a lookup table is required.

Mapping between an uncompressed position instead cannot be calculated and a 'zmap' table listing all sector start position and corresponding on-disk points needs to be created. The size of a sector must be determined aswell.

Bzip2:

It is fairly easy to seek on compressed bzip2 streams. In a Bzip2 stream, each block is totally separate and self contained---have a look a the 'bzip2recover' program. 'bzip2recover' scans through a bzip2 stream locating each block/'segment' (normally 900kB of input, so ~200kB of compressed output), then copying each segment to separate bzip2 file.

Gzip:

Seeking on gzip streams is somewhat more involved; Gzip uses a dictionary design where back-references are made to uncompressed data within the last 32kB ("the window").

The only safe place to start reading a 'sector' in a gzip stream is somewhere where that the dictionary size is zero. An empty dictionary occurs at the start of a stream, or where the stream has been 'reset'.

Gzip resets:

A gzip 'sector' (the length required to be read, to safely decode a byte) could be the full length of the file-size, or could occur more frequently. A gzip sector start occurs when the stream is reset---resetting the stream more frequently leads to more sector starts, but a reduction in compression thanks to less use of the sliding dictionary. Trade-off.

The "gzip --rsyncable" is an example of stream resetting; in this case the stream is reset (and the dictionary closed) each time that the sum of the last 4096 octets of input data, modulo 65536(?) equals a magic number (zero normally). In exchange for the extra reset/start points, you get a size increase of 4-5%, but a better "hit rate" for random access.

Gzip reset point lookup tables:

More likely you'd want to reset the stream every eg. 4096kB of input, then save the corresponding position of the output stream for a give input position. This lookup table of start positions needs to be stored

somewhere. 'squashfs' does in a separate file; 'zsync' in the zmap.

(If you wanted to define a standard, these reset points could be stored in the 'extra' or 'comment' fields at the start of a gzip file; storing them with a suitable header would enable some degree of seeking for sufficiently intelligent readers---files could be post-processed to add this data).

Storing a LUT takes space; adding extra reset points and storing the LUT data for these extra points takes more space---but I found storing the LUT as a set of double-deltas cut it down; could be gzipped aswell.

Gzip partial resets:

A gzip stream doesn't just contain 'reset' points, but also 'partial reset' points, a partial resets the huffman tables, but doesn't reset the dictionary. It's safe to stop reading at a partial reset (which

effectively shortens the sector length).

Whilst not normally possible, partial reset points /are/ able to be used as a starting point /if/ you can populate the previous uncompressed 32kB that forms the dictionary window from another source---'zsync' does this when reconstructing a file, however 'zsync' only does construction in a linear fashion, from the start of a file.

Further Reading:

Magic words to search for:

zsync, squashfs, apt-sync, succinct

Succinct is my unfinished project. Succinct involves many parts, one of which covers what you're doing. It might be good to break separate off seeking of compressed streams into a separate project---many small pieces make big problems easier!

What I might do is store the LUT in the EXTRA field of the gzip header (limited to 64kB though). Though, within 64kB it would always be possible to construct a useful LUT; based on the final length of the input data, it would be possible to reduce the granularity of the LUT such that it only mentioned 50%, 30%... even 5% of reset points. The LUT itself could be gzipped for extra gain.

Hope that's useful,

-Paul

PS. In all the cases (except one) that I mentioned gzip, I probably meant 'zlib' or 'deflate'! :)

In a later mail Paul also added the following:

BTW, in my search around a bit later for something I turned up two more things that might be interesting:

dictzip, examples/zran.c in the zlib source

'dictzip' is a similar idea to what I suggested at the end of the previous email. A lookup table is built and stored in the EXTRA section at the start of the gzip file and the utilities ('dictzcat' et al) then use this look-up table for random-access on the data.

'dictzip' are still slightly wide of the mark, forcing a flush/chunk every 58969 bytes. That number is chosen so that even in the worst case, the compressed version will never be more than 216 in length. However I think their math is slightly wrong as even in the worse case, gzip only generates 5 bytes per 32kB for an uncompresed store. -> ~65514 by my reckoning. Squashfs uncompressable chunks by setting the top bit of the file offset to indicate a true store.

Their lookup table format appears to be flexible enough that I /might/ be able to create a table by scanning an existing stream rather than re-encoding. Depends about the 64kB limitation. Mmm.

Thanks Paul!

no comments | Read more...


Seeking on gzip streams?

Posted on 2006-10-13 (金) at 12:00 am

Posted in: Hacking

Dear Lazyweb,

do you know anything that implements seeking on gzip/zlib streams?

Seeking forward is easily done by reading the data up to the requested position, which isn't nice, but works well. Seeking backwards may be implemented by something like this:

inflateEnd (&stream);
fseek (input, 0, SEEK_SET);
inflateInit (&stream);
/* read up to the requested position */

This will be terribly slow for some cases, but well..

I'm just wondering how seeking with SEEK_END could be implemented without inflating the whole stream at first.

no comments | Read more...


What package is eating up my disk space?

Enrico, how about dpigs(1) from the debian-goodies package to solve your disk usage problem?

It's written in #!/bin/sh and not that nifty, but, if you omit the option parsing and usage code, it's even shorter. It basically does

grep-status -nsInstalled-size,Package -F Status ' installed' $STATUS \
| perl -p00l12 -e 's/\n/ /' \
| sort -rn \
| head --lines=$LINES

More

aircrack-ng hacking

Posted on 2006-4-1 (土) at 3:36 am by Unknown Name [SIGNED]

Posted in: Hacking

Aircrack, one my favourite wifi tools, was quite dead for a while. Bad enough, but after I switched to the madwifi-ng driver recently I discovered that I can't do injection anymore, which is quite bad. I searched for a workaround and found aircrack-ng, a fork of aircrack, which has been made recently. There weren't much changes to the aircrack codebase yet, but at least everything works well with madwifi-ng now.

Now, as aircrack seems to be moving again I mplemented some changes I always wanted to have, such as attack names (documentation patch) instead of attack numbers I could never remember and tried to get them submitted to upstream.

What I normally do that situation is to check out the latest development version from the projects version control system, prepare a patch against it and send it to the development mailing-list afterwards. That's hardly possible with aircrack-ng. Not only that one or two persons are having a master copy of the code instead of using a nice, public accessibly version control system, but also the total lack of mailing-lists or any other possibility to coordinate development besides an IRC channel and a wiki, which is only partially editable, makes contributing quite hard. After announcing some of my patches on IRC several times one of the upstream developers finally took a look at it and said he applied it so it'll be released in the near future.

I'm happy about that, but the deficits in the development process still exist. Therefor I offered to sponsor hosting, including mailing-lists, a version control system, a bug-tracking system and whatnot. It was thankfully declined as they already seem to have a hosting plan. Great! Unfortunately they don't have the money for that currently, but they at least have plan. Hooray for aircrack-ng..

Debian packages for the latest release already exist and should be upload in the next days, btw.

no comments | Read more...


Even more wireless fun

Posted on 2006-4-1 (土) at 3:26 am by Unknown Name [SIGNED]

Posted in: Hacking

After having some fun with madwifi-ng and hostap already, I decided to put the Atheros MiniPCI card I bought some months ago in my laptop to be able to use the cool madwifi-ng drivers there as well.

Unfortunately after putting it in and booting you get that:

1802: Unauthorized network card is plugged in -
Power off and remove the miniPCI network card.

Some investigation brought up that this is caused he card's PCI-ID being checked against a whitelist in the BIOS. I liked IBM ThinkPads, and especially my X41, quite much so far, but this check hardly seems to have a technical background. It just forces people to buy IBM authorized cards..

Anyway, I wanted to use an Atheros card. So what were the possibilities?

Buying a PCMCIA card with Atheros chip? No.. I already had an Atheros card around and wanted to use the integrated wifi. Some people reported that the check could be switched of by flipping a CMOS bit. That doesn't seem to work with my X41. I could still have flashed the BIOS with a version with a changed whitelist, but I didn't want to fuck up a 1,600 EUR laptop. Therefor I decided to change the PCI-ID of my card. Only 40 EUR would have been lost if I did it wrong.

I found a nice article that explains how to do that very well. Basically you need to

remove the MiniPCI wifi card
switch the laptop on
put the card in while the laptop is running (after the BIOS checks and before the Linux kernel boots)
install madwifi
use a special tool to change the PCI-ID to the ID of an authorized card
hack the driver to recognize the card with the new PCI-ID
go outside, enjoy the nice weather and hack

Life can be so easy.

no comments | Read more...


Wireless fun

Posted on 2006-3-27 (月) at 1:02 pm by Unknown Name [SIGNED]

Posted in: Hacking

Have you ever been bitten by a problem that resets all wifi settings like mode, essid and whatnot just some seconds after you set them? A certain mister Dreker suggest to simply shut down wpa_supplicant on that machine if it is running, as it always resets such settings to be able to receive all packets from all access points.

So, after doing so I now know why my wifi access point, which is just a normal Debian box, wasn't working for the last couple of months. For some reason wpa_supplicant was installed and running.. I just can't remember to have installed it as it hardly makes sense to me. Anyway, after I could set the wifi settings again I thought it wouldn't be much of a problem to get hostapd with WPA-EAP working again. Wrong!

wpa_supplicant on the client side somehow thought it was authenticated, but it actually wasn't. So I checked the debug output of hostapd which said something like that:

ioctl[IEEE80211_IOCTL_SETMLME]: Argument list too long

and after that line it said the it deauthenticated the client because of a local request, so I thought that's the problem. It pretty much looks like a binary incompatibility to me. I updated the madwifi drivers before in an attempt to to fix the essid reset problem, so hostapd, which needs to be compiled against the madwifi headers didn't work properly anymore. So I tried to recompile it and things got even worse:

ioctl[unknown???]: Operation not supported

I had no idea what to do about that, so I joined #madwifi on freenode. I have been told that I'll need at least hostapd 0.4.whatnot to make it run with madwifi-ng. Unstable has 0.5.0, so I downloaded that package, which contains madwifi-ng includes do build against, and compiled it on my sarge box. Almost exactly in the moment I was done I found out that I would need hostapd 0.5.2 to make it work with madwifi-ng and WPA2. Grr...

So I compiled it again, installed it and everything should have run fine. For some reason it didn't. After a while starting and stopping it to take a look at the debug output it somehow started to work and it keeps working up to now. I have actually no idea what could have caused that problem. Stijn Tintel on #madwifi could think of re-keying at the wrong moment or something like that, but isn't quite sure about that.

Anyway, wifi is up and working again and that makes me happy. :-)

no comments | Read more...


Navigation

Categories

Login