Skip to content
Advertisement

Python Stream Decode Base64 to valid utf8

I’m currently working on a problem, which I just cannot seem to find the proper solution. Maybe you guys can help me, thanks!

What I am trying to do

  • Web is returning a JSON (value is encoded in valid BASE64, which was before that ùtf-8)
    • requests.get(url, stream=True
  • streaming from requests (chunks=1020)
    • iter_content(chunk_size=1020, decode_unicode=False)
  • do some chunk work (replacing everything with a regex that is not base64)
  • add padding if chunk%4!=0
  • decoding each Base64chunk
    • base64.b64decode(lines_prepared_after_stream).decode('utf-8')
  • write decoded utf8 into a file But this does not seem to work. The decoding works but does not deliver correct utf-8 chunk-wise such that I cannot write it into a file properly.

Any Ideas where my thoughts went wrong?

Example Data (my case was this data but just multiples txt file size): JSON base64:

{
"blob":"bmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXM="
}

Txt file:

nam
tom
sarah
tim
nim
bim
sven
monika
chris
bla
blub
bom
sdfsdfsddasdasdas

Code (shortened and skipped, but something along this)

def stream_decode_write(document_ids, client: requests.Session, chunk_size=1024):
   f = open(os.path.join(constants.DEFAULT_DATA_UPLOAD_DIR, document_id), 
           "wb")
   r = client.get(document_url, stream=True, verify=False)
        for lines in r.iter_content(chunk_size, decode_unicode=True):
            # filter keep-alive
            if lines:
                lines = replace_json_input(lines)
                missing_padding = len(lines) % 4
                if missing_padding:
                    lines += '=' * (4 - missing_padding)
                decoded = base64.b64decode(lines.strip()).decode('utf-8)
                f.write(decoded)

code i use to encode

def encode_to_base64(file_to_encode) -> str:
    byte_coding = file_to_encode.encode()
    data = base64.b64encode(byte_coding)
    return data.decode('ascii')

def read_from_file(file_path):
    f = open(file_path, "r", encoding='utf-8')
    return f.read()
  • first read, then encode, then upload as json

I am guessing that some chunks need some bytes to be prefixed/after to be proper utf-8. Maybe I need some kind of this, but I am just unsure though. Or Iam dividing the Information to chunks on the wrong places, where actually information continuation is needed..

I looked into this one: Validate that a stream of bytes is valid UTF-8 (or other encoding) without copy without success.

Maybe useful: I do have the might about the file encoding. Currently I am loading the file into memory and just encode it, put into utf8 and send it upstream per json.

Example

base encoded:

"aGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRh"

part which is not decodable:

ib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRh

even if i add padding

LOG output:

2019-04-03 10:51:26,549 - chunk_size: 1024
2019-04-03 10:51:26,549 - before replace: {"blob":"aGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJ
2019-04-03 10:51:26,549 - replaced chunk : aGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJ
2019-04-03 10:51:26,549 - chunk_size after change: 1015

Advertisement

Answer

Solution I finally used with the help from Mark and a colleague

Because I am streaming I cannot ensure that every byte I receive is a proper multibyte Unicode. Thus it can happen, that decoding fails. As @mark already has mentioned. With this in mind, I have worked on another solution.

As it seems, the decoder makes sure if the received data is a multibyte Unicode it will fail, if the chunk will not be complete.

data = '我去过上海。'

Let’s say this is 5 bytes, but I only receive 4 bytes.

decoded = base64.b64decode(data). Decode('utf-8')

will fail. thus I probably need a byte from the next chunk. I can wrap this in a except block and try again. This greedy algorithm should hopefully work if it is not multibyte it can decode it. If it is multibyte it will fail and greedy until it can decode it.

prev_chunk.append(current_chunk.pop(0))

I need to repeat this action until I either receive proper decoded UTF-8 or my prev_chunk and current_chunk leads byte-wise combined to nothing.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement