I have a compressed file that is a CAB that I wish to extract a file from in Linux. Since there aren’t any native CAB extractors on Linux, I figured I’d try my hand at getting one done.
While I’ve seen the MSZIP documentation[0] as well as [1] and [2], I’m having difficulties in decompressing it even given that each Block is compressed using a modified DEFLATE compressor. While I did use [3] to decompress the blocks, aside for the first block, the rest of the blocks had a lot of missing data [data which were completely nulled [i.e. 0x00 filled].
So I’m lead to believe I’ll need to figure it out manually.
Unfortunately, I am having trouble understanding [2]. So, with the following data:
ED 9D 79 70 1C C7 75 87 07 E0 4D 88 24 EE 8B 94 04 1E A2 44 51 04 81 05 08 2D 11 4A C2 5E 00 96 ...
And the following code to dissect the data:
#!/bin/env python import os def to_bin(in_item): b = ord(chr(in_item)) retval = bin(b).replace("0b", "") if len(retval) < 8: lx = 8 - len(retval) q = "0" * lx retval = q + retval return retval def swapbytes(in_bit_str): p = in_bit_str[:4] q = in_bit_str[4:] return q + p pth = os.path.join("comp", "test_0.dat") fp = open(pth, 'rb') fpd = fp.read(10) fp.close() dx = "" for item in fpd: cx = to_bin(item) scx = swapbytes(cx) dx += scx print(cx) print("---> %s" % scx) print("Final: %s" % dx) bfinal = dx[0] btype = dx[1:3] print("Bfinal: %s" % bfinal) print("Btype: %s" % btype) rest = dx[3:] hlit = rest[:5] hdist = rest[5:11] hclen = rest[11:15] hlitd = int(hlit, 2) hdistd = int(hdist, 2) hclend = int(hclen, 2) print("HLIT: %s [%d]" % (hlit, hlitd)) print("HDIST: %s [%d]" % (hdist, hdistd)) print("HCLEN: %s [%d]" % (hclen, hclend))
I get the following output:
$ python tstdcmp.py 11101101 ---> 11011110 10011101 ---> 11011001 01111001 ---> 10010111 01110000 ---> 00000111 00011100 ---> 11000001 11000111 ---> 01111100 01110101 ---> 01010111 10000111 ---> 01111000 00000111 ---> 01110000 11100000 ---> 00001110 Final: 1101111011011001100101110000011111000001011111000101011101111000011100000 0001110 Bfinal: 1 Btype: 10 HLIT: 11110 [30] HDIST: 11011 [27] HCLEN: 0011 [3]
So the basic concept that I got from the RFC1951 [2], was that in a byte, it’s from highest bit to lowest bit. i.e. 7 6 5 4 3 2 1 0 But for multiple bytes, i.e. AB CD, the bytes are swapped to CD AB.
So with the code, I determined that the first two bytes ED 9D
should actually be 9D ED
.
And in bits:
11011110 11011001
That would mean BFinal == 1 and BTYPE == 10 (so dynamic huffman encoding).
With the first three bits removed, I get:
11110 11011001
So from 3.2.7 from [2], I should have the following info of the next few sets of bits:
5 Bits: HLIT, # of Literal/Length codes - 257 (257 - 286) 5 Bits: HDIST, # of Distance codes - 1 (1 - 32) 4 Bits: HCLEN, # of Code Length codes - 4 (4 - 19)
So, HLIT = 11110 [30] HDIST = 11011 [27] HCLEN = 0011 [3]
But from the above, the HLIT should be between 257 and 286 and HCLEN between 4 and 9; but I get HLIT = 30 and HCLEN = 3.
That said, the documentation in 3.2.7:
For even greater compactness, the code length sequences themselves are compressed using a Huffman code.
So is this an example of further compressing with huffman codes? I had hoped to understand this RFC1951[2]; but it’s terribly confusing.
I have seen some Youtube videos on DEFLATE compression; but they show mostly on the algorithm and not how the bytes are packed in a file.
Any help greatly appreciated.
Thanks
[1] – How to decompress MSZIP files with c#?
[2] – https://datatracker.ietf.org/doc/html/rfc1951
[3] – https://www.nayuki.io/page/simple-deflate-decompressor
Advertisement
Answer
- No, the “basic concept” you thought you got from RFC 1951 is completely wrong. First off, the bits in each byte are read from least significant to most significant. Second, you do not reverse the bytes in the stream. The first eight bits are first read from the first byte, the second eight bits from the second byte, and so on. (Lengths in stored blocks are stored little-endian, but that is neither reversed nor not reversed. It is simply how a 16-bit length is serialized in the byte stream.)
- Once you read the bits correctly, the
HLIT
, etc. values are stored exactly as stated in the RFC. The first five bits are the number of literal/length codes minus 257. So you take the value of the five bits, which gives a number in 0..31, and add 257 to that. That gives a number in the range 257..288. The allowed range is actually 257..286 as noted on that same line, so the last two possible values of the five bits, 30 and 31, should not appear in a valid deflate stream. - RFC 1951 is not confusing at all. It is a clear and complete description of the format. However you need to have sufficient background in compression, in particular Huffman codes, to understand it. The RFC was not intended to be a textbook on compression, nor a textbook on how integers are coded in bits.
- It is clear it would take you some time to figure this all out. Fortunately, you do not need to write your own inflator. You can instead use zlib. Read the documentation in zlib.h for all of the
inflate
functions. - In CAB files, MSZIP CFDATA blocks use the history from previous CFDATA blocks, until a folder boundary is reached. Even though each block is a properly terminated deflate stream, the next block can refer to uncompressed data from the previous block. To process CFDATA blocks after the first, you will need to use the
inflateResetKeep()
function of zlib to restart the inflate process while retaining the dictionary from the previous inflate operation.
For reference, here is a decoding of the initial bytes of the deflate stream you provided, using infgen:
! infgen 2.5 output ! last ! 1 dynamic ! 10 count 286 30 16 ! 1100 11101 11101 code 16 4 ! 100 code 17 7 ! 111 code 0 4 ! 100 000 code 8 3 ! 011 code 7 4 ! 100 code 9 3 ! 011 code 6 4 ! 100 code 10 3 ! 011 code 5 4 ! 100 code 11 3 ! 011 code 4 5 ! 101 code 12 3 ! 011 code 3 7 ! 111 code 2 6 ! 110 000 lens 6 ! 1100 lens 8 ! 000 lens 8 ! 000 lens 9 ! 001 repeat 6 ! 11 1110 lens 9 ! 001 lens 8 ! 000 lens 10 ! 010 lens 9 ! 001 lens 9 ! 001 lens 9 ! 001 lens 8 ! 000 repeat 6 ! 11 1110 repeat 4 ! 01 1110 lens 9 ! 001 lens 9 ! 001 lens 10 ! 010 lens 10 ! 010 lens 10 ! 010 lens 8 ! 000 lens 9 ! 001 repeat 3 ! 00 1110 lens 10 ! 010 lens 9 ! 001 lens 10 ! 010 lens 10 ! 010 lens 9 ! 001 lens 10 ! 010 lens 9 ! 001 lens 10 ! 010 lens 9 ! 001 lens 8 ! 000 lens 9 ! 001 lens 8 ! 000 lens 8 ! 000 lens 7 ! 1101 lens 8 ! 000 lens 8 ! 000 lens 9 ! 001 lens 8 ! 000 lens 10 ! 010 lens 7 ! 1101 lens 9 ! 001 lens 8 ! 000 lens 12 ! 100 lens 9 ! 001 lens 10 ! 010 lens 10 ! 010 lens 10 ! 010 lens 8 ! 000 lens 7 ! 1101 repeat 4 ! 01 1110 lens 8 ! 000 lens 8 ! 000 lens 8 ! 000 lens 7 ! 1101 lens 9 ! 001