Some program sends some info that starts with x1fxe2x80xb9x08x00x00x00x00x00x04x00M...
to the server and receives the text response. I need to guess what info is it.
In fact, I need the method to convert the real string to that identical gzipped original string to receive responses without that program.
After the investigation I’ve found that first I should decode data from utf8 to cp1251 (after that the first symbols x1fxe2x80xb9x08
will be x1f8bx08
that is typical gzip magic string). It will be corrupted gzip, but if I cut it’s header (first 10 symbols) I can extract the final readable message.
But this message is little corrupted (starts correctly, but later some symbols are shuffled).
What should be done to properly read the data?
I guess that during decode_binary_from_utf8_to_cp1251 I loose some info, because if I don’t use on_errors=’replace’ the data can’t be converted correctly (I’ve tried others encoding that also do x1fxe2x80xb9x08
to x1f8bx08
magic but without success, no of encoding were able to convert 100% without errors). And also when I cut the header (first 10 symbols of the gzipped string) some data also can be missed.
My code:
import zlib import base64 def decode_binary_from_utf8_to_cp1251(data): enc_from = "utf8" enc_to = "cp1251" on_errors = "replace" # on_errors = "" return data.decode(enc_from, on_errors).encode(enc_to, on_errors) def remove_archive_signature_from_start(data): return data[10:] def decompress_gzip(body): args = (-zlib.MAX_WBITS | 16,) # working return zlib.decompress(body, *args) def convert_binary_to_normal_text(b, encoding="cp1251"): b = b.decode(encoding, "replace") return b base64_encoded = "L2dldC8f4oC5CAAAAAAABABN4oCZX+KAmdCrIAzQltCH0KLQ ... gMAAA==" data = base64.b64decode(base64_encoded)[5:] # data = b'x1fxe2x80xb9x08x00x00x00x00x00x04x00...x03x00x00' new_data = decode_binary_from_utf8_to_cp1251(data) new_data = remove_archive_signature_from_start(new_data) decompressed = decompress_gzip(new_data) normal_text = convert_binary_to_normal_text(decompressed) print(f"{normal_text=}")
returns text like
... ;btennis,1oatchoomkcom®1i,hoomkcomwilliamhillmkcomwom;bein.zegoalbet.cal;bmosityom;beokt;favet.colpasbein.zeni;bmosbet.learathssityligbetavtchoomkpar)rrathssitnoarathoinfo ...
, starts correctly, but later some symbols are shuffled (because I know exactly that it should include string ;wwin.com;zebet.com;baltbet.ru;winlinebet.com;golpas.com;zenitbet.com;leonbets.ru;ligastavok.com;parimatch.com;fonbet.info®
)
Any ideas what am I missing?
Advertisement
Answer
“Some program” has a bug that needs to be fixed. In general, UTF encoding is not lossless so the original data is not recoverable. That program needs to not do any such conversion, and instead send the original binary.
I was able to recover the original gzip file from the example by using the table on the Windows-1251 Wikipedia page, with an addition. You will note that that table has nothing for the character 0x98
. I assume that the unicode symbol U+0098
translates to the byte 0x98
. Applying that translation and dropping the first five bytes of the result gives a valid gzip stream with a correct CRC and length check.
There is no guarantee that this will work in general, since the provided example does not have all possible byte values.