Skip to content
Advertisement

Strange gzip – almost extracted, but not totally correct

Some program sends some info that starts with x1fxe2x80xb9x08x00x00x00x00x00x04x00M... to the server and receives the text response. I need to guess what info is it.

In fact, I need the method to convert the real string to that identical gzipped original string to receive responses without that program.

After the investigation I’ve found that first I should decode data from utf8 to cp1251 (after that the first symbols x1fxe2x80xb9x08 will be x1f8bx08 that is typical gzip magic string). It will be corrupted gzip, but if I cut it’s header (first 10 symbols) I can extract the final readable message.

But this message is little corrupted (starts correctly, but later some symbols are shuffled).

What should be done to properly read the data?

I guess that during decode_binary_from_utf8_to_cp1251 I loose some info, because if I don’t use on_errors=’replace’ the data can’t be converted correctly (I’ve tried others encoding that also do x1fxe2x80xb9x08 to x1f8bx08 magic but without success, no of encoding were able to convert 100% without errors). And also when I cut the header (first 10 symbols of the gzipped string) some data also can be missed.

My code:

import zlib
import base64


def decode_binary_from_utf8_to_cp1251(data):
    enc_from = "utf8"
    enc_to = "cp1251"
    on_errors = "replace"
    # on_errors = ""
    return data.decode(enc_from, on_errors).encode(enc_to, on_errors)


def remove_archive_signature_from_start(data):
    return data[10:]


def decompress_gzip(body):
    args = (-zlib.MAX_WBITS | 16,)  # working
    return zlib.decompress(body, *args)


def convert_binary_to_normal_text(b, encoding="cp1251"):
    b = b.decode(encoding, "replace")
    return b


base64_encoded = "L2dldC8f4oC5CAAAAAAABABN4oCZX+KAmdCrIAzQltCH0KLQ ... gMAAA=="

data = base64.b64decode(base64_encoded)[5:]
# data = b'x1fxe2x80xb9x08x00x00x00x00x00x04x00...x03x00x00'

new_data = decode_binary_from_utf8_to_cp1251(data)
new_data = remove_archive_signature_from_start(new_data)

decompressed = decompress_gzip(new_data)
normal_text = convert_binary_to_normal_text(decompressed)

print(f"{normal_text=}")

returns text like

...
;btennis,1oatchoomkcom®1i,hoomkcomwilliamhillmkcomwom;bein.zegoalbet.cal;bmosityom;beokt;favet.colpasbein.zeni;bmosbet.learathssityligbetavtchoomkpar)rrathssitnoarathoinfo
...

, starts correctly, but later some symbols are shuffled (because I know exactly that it should include string ;wwin.com;zebet.com;baltbet.ru;winlinebet.com;golpas.com;zenitbet.com;leonbets.ru;ligastavok.com;parimatch.com;fonbet.info®)

Any ideas what am I missing?

Advertisement

Answer

“Some program” has a bug that needs to be fixed. In general, UTF encoding is not lossless so the original data is not recoverable. That program needs to not do any such conversion, and instead send the original binary.

I was able to recover the original gzip file from the example by using the table on the Windows-1251 Wikipedia page, with an addition. You will note that that table has nothing for the character 0x98. I assume that the unicode symbol U+0098 translates to the byte 0x98. Applying that translation and dropping the first five bytes of the result gives a valid gzip stream with a correct CRC and length check.

There is no guarantee that this will work in general, since the provided example does not have all possible byte values.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement