Skip to content
Advertisement

Error reading file — ‘utf’ can’t decode byte 0xff in position 45: invalid start byte

I’ve got these two scripts right here, send.py and receive.py. Send.py is a host, it opens a connection and waits for receive.py to connect. Once the connection is successfull, in theory, I could send any file from one device (with the send.py script) to another (with the receive.py script). Little problem… I was trying to read from a random music file I found on my computer to make sure it works with any type of file and encoutered the following error:

UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xff in position 45: invalid start byte

What causes this error?

send.py:

from socket import *

port = 42069
s = socket(AF_INET, SOCK_STREAM)
s.bind(('0.0.0.0', port))
s.listen(1)

c, addr = s.accept()

buffersize = 128

fname = '✵ТГК -Гелик 2022✵ Gelik✵-160 (mp3cut.net).mp3' #input('File Path: ')

with open(fname, 'rb') as file:
    readfc = file.read()

c.send(fname.encode())

if len(readfc) > buffersize:
    for packet in range(len(readfc) % buffersize):
        c.send(readfc[0:buffersize])

and receive.py:

from socket import *

port = 42069
s = socket(AF_INET, SOCK_STREAM)
s.connect(('192.168.0.171', port))

index = 0
while True:
    data = s.recv(1024)
    if not data:
        pass
    else:
        index += 1
        if index == 1:
            filename = data.decode()
        else:
            with open(filename, 'ab') as file:
                file.write(data.decode())

And here are the first lines from the msuic file:

ID3     #TSSE     Lavf59.16.100           яыа                                 Info     #R ђ.3 

!$&)+.0369:=@CEGJMORUVY_acfiknqsux{}Ђ‚…‡ЉЌЏ‘”—љњћЎЈ¦©«­°і¶ёєЅАВЕЗКМПТФЦЩЬЮбгжилортхшъэ    Lavc59.18            $@     ђ.3ЮЬмf                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    яыаD р  i   ```

Advertisement

Answer

This code is assuming that a single send in the sender matches a single recv in the recipient. This assumption is wrong for TCP: TCP is only an unstructured byte stream and not a structured message transport which would preserve message boundaries over send/recv.

This means that the initial data = s.recv(1024) in the recipient might not only include the filename, but might also already include parts of the music file. Thus it is a mix of the utf-8 encoded filename (multi-byte characters) followed by the binary music data (bytes). Trying to filename = data.decode() on this will successfully decode the initial filename. But it will continue to decode the data after the end of the filename and thus treat the binary music data also as multi-byte characters encoded in utf-8. This will lead to the observed decoding error.

The fix should be to clearly mark where the filename ends and the binary data start and then only decode the filename as text and treat the rest as bytes. A common approach is to prefix the filename with the length so that it is clear where it ends. Another approaches might to add a at the end of the filename (since it is not part of valid utf-8 encoded character except NUL – which itself is invalid in filenames) and split the incoming data on this delimiter.

Apart from that the later data.decode() when reading the music data is plain wrong since there is no matching encode() on the sender side. And there should not be one since these are binary data, i.e. already bytes.

Advertisement