Skip to content
Advertisement

Python Stream Decode Base64 to valid utf8

I’m currently working on a problem, which I just cannot seem to find the proper solution. Maybe you guys can help me, thanks!

What I am trying to do

  • Web is returning a JSON (value is encoded in valid BASE64, which was before that ùtf-8)
    • requests.get(url, stream=True
  • streaming from requests (chunks=1020)
    • iter_content(chunk_size=1020, decode_unicode=False)
  • do some chunk work (replacing everything with a regex that is not base64)
  • add padding if chunk%4!=0
  • decoding each Base64chunk
    • base64.b64decode(lines_prepared_after_stream).decode('utf-8')
  • write decoded utf8 into a file But this does not seem to work. The decoding works but does not deliver correct utf-8 chunk-wise such that I cannot write it into a file properly.

Any Ideas where my thoughts went wrong?

Example Data (my case was this data but just multiples txt file size): JSON base64:

JavaScript

Txt file:

JavaScript

Code (shortened and skipped, but something along this)

JavaScript

code i use to encode

JavaScript
  • first read, then encode, then upload as json

I am guessing that some chunks need some bytes to be prefixed/after to be proper utf-8. Maybe I need some kind of this, but I am just unsure though. Or Iam dividing the Information to chunks on the wrong places, where actually information continuation is needed..

I looked into this one: Validate that a stream of bytes is valid UTF-8 (or other encoding) without copy without success.

Maybe useful: I do have the might about the file encoding. Currently I am loading the file into memory and just encode it, put into utf8 and send it upstream per json.

Example

base encoded:

JavaScript

part which is not decodable:

JavaScript

even if i add padding

LOG output:

JavaScript

Advertisement

Answer

Solution I finally used with the help from Mark and a colleague

Because I am streaming I cannot ensure that every byte I receive is a proper multibyte Unicode. Thus it can happen, that decoding fails. As @mark already has mentioned. With this in mind, I have worked on another solution.

As it seems, the decoder makes sure if the received data is a multibyte Unicode it will fail, if the chunk will not be complete.

JavaScript

Let’s say this is 5 bytes, but I only receive 4 bytes.

JavaScript

will fail. thus I probably need a byte from the next chunk. I can wrap this in a except block and try again. This greedy algorithm should hopefully work if it is not multibyte it can decode it. If it is multibyte it will fail and greedy until it can decode it.

JavaScript

I need to repeat this action until I either receive proper decoded UTF-8 or my prev_chunk and current_chunk leads byte-wise combined to nothing.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement