Codec error while reading a file in python – ‘charmap’ codec can’t decode byte 0x81 in position 3124: character maps to

Question

I am working on a Machine Learning Project which filters spam/phishing emails out of all emails. For this, I am using the SpamAssassin dataset. The dataset contains different mails in this format: For identifying phishing emails, first thing I have to do is finding out how many web-links the email has. For do…

Accepted Answer

You have to open and read the file using the same encoding that was used to write the file. In this case, that might be a bit difficult, since you are dealing with e-mails and they can be in any encoding, dependent on the sender. In the example file you showed, the message is encoded using &#8216;iso-8859-1&#8217; encoding.However, e-mails are a bit strange, since they consist of a header (which is in ASCII format as far as I know), followed by an empty line and the body. The body is encoded in the encoding that was specified in the header. So two different encodings could be used in the same file!If you&#8217;re sure that all the e-mails use iso-8859-1 encoding and you&#8217;re looking for a quick-and-dirty solution, then you could also just open the file using &#8216;iso-8859-1&#8217; encoding, since e-mail headers are compatible with iso-8859-1. However, be prepared that you will have to deal with other e-mail formatting/encoding/escaping issues as well, or your script might not work completely as expected.I think the best solution would be to look for a Python module that can handle e-mails, so it will deal with all the decoding stuff and you don&#8217;t have to worry about that. It will also solve other problems such as escape characters and line breaks.I don&#8217;t have experience with this myself, but it seems that Python has built-in support for parsing e-mails using the e-mail package. I recommend to take a look at that.

Advertisement

Answer