I’m currently working on a problem, which I just cannot seem to find the proper solution. Maybe you guys can help me, thanks!
What I am trying to do
- Web is returning a JSON (value is encoded in valid
BASE64
, which was before thatùtf-8
)requests.get(url, stream=True
- streaming from requests (chunks=1020)
iter_content(chunk_size=1020, decode_unicode=False)
- do some chunk work (replacing everything with a regex that is not base64)
- add padding if
chunk%4!=0
- decoding each
Base64
chunkbase64.b64decode(lines_prepared_after_stream).decode('utf-8')
- write decoded utf8 into a file
But this does not seem to work. The decoding works but does not deliver correct
utf-8
chunk-wise such that I cannot write it into a file properly.
Any Ideas where my thoughts went wrong?
Example Data (my case was this data but just multiples txt file size): JSON base64:
{ "blob":"bmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXM=" }
Txt file:
nam tom sarah tim nim bim sven monika chris bla blub bom sdfsdfsddasdasdas
Code (shortened and skipped, but something along this)
def stream_decode_write(document_ids, client: requests.Session, chunk_size=1024): f = open(os.path.join(constants.DEFAULT_DATA_UPLOAD_DIR, document_id), "wb") r = client.get(document_url, stream=True, verify=False) for lines in r.iter_content(chunk_size, decode_unicode=True): # filter keep-alive if lines: lines = replace_json_input(lines) missing_padding = len(lines) % 4 if missing_padding: lines += '=' * (4 - missing_padding) decoded = base64.b64decode(lines.strip()).decode('utf-8) f.write(decoded)
code i use to encode
def encode_to_base64(file_to_encode) -> str: byte_coding = file_to_encode.encode() data = base64.b64encode(byte_coding) return data.decode('ascii') def read_from_file(file_path): f = open(file_path, "r", encoding='utf-8') return f.read()
- first read, then encode, then upload as json
I am guessing that some chunks need some bytes to be prefixed/after to be proper utf-8
. Maybe I need some kind of this, but I am just unsure though.
Or Iam dividing the Information to chunks on the wrong places, where actually information continuation is needed..
I looked into this one: Validate that a stream of bytes is valid UTF-8 (or other encoding) without copy without success.
Maybe useful: I do have the might about the file encoding. Currently I am loading the file into memory and just encode it, put into utf8 and send it upstream per json.
Example
base encoded:
"aGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRh"
part which is not decodable:
ib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRh
even if i add padding
LOG output:
2019-04-03 10:51:26,549 - chunk_size: 1024 2019-04-03 10:51:26,549 - before replace: {"blob":"aGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJ 2019-04-03 10:51:26,549 - replaced chunk : aGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJpcwpibGEKYmx1Ygpib20Kc2Rmc2Rmc2RkYXNkYXNkYXMKZGFzZGFzCmRhc2Rhc2QKYXNkYWRhaGFsbG8KbmFtCnRvbQpzYXJhaAp0aW0KbmltCmJpbQpzdmVuCm1vbmlrYQpjaHJ 2019-04-03 10:51:26,549 - chunk_size after change: 1015
Advertisement
Answer
Solution I finally used with the help from Mark and a colleague
Because I am streaming I cannot ensure that every byte I receive is a proper multibyte Unicode. Thus it can happen, that decoding fails. As @mark already has mentioned. With this in mind, I have worked on another solution.
As it seems, the decoder makes sure if the received data is a multibyte Unicode it will fail, if the chunk will not be complete.
data = '我去过上海。'
Let’s say this is 5 bytes, but I only receive 4 bytes.
decoded = base64.b64decode(data). Decode('utf-8')
will fail. thus I probably need a byte from the next chunk. I can wrap this in a except block and try again. This greedy algorithm should hopefully work if it is not multibyte it can decode it. If it is multibyte it will fail and greedy until it can decode it.
prev_chunk.append(current_chunk.pop(0))
I need to repeat this action until I either receive proper decoded UTF-8 or my prev_chunk
and current_chunk
leads byte-wise combined to nothing.