Skip to content
Advertisement

How to completely sanitize a string of illegal characters in python?

I have a feature of my program where the user can upload a csv file, which my program goes through and uses as input. I have one user complaining about a problem where his input is throwing up an error. The error is caused by there being an illegal character that is encoded wrong. The characters is below:

JavaScript

Sometimes it appears as a diamond with a “?” in the middle, sometimes it appears as a double diamond with “?” in the middle, sometimes it appears as “xa0”, and sometimes it appears as “xa0xa0”.

In my program if I do:

JavaScript

The string will show up in my terminal with the diamond “?” in place of the weird character. If I copy+paste that string into ipython, it will exit with this message:

JavaScript

notice how the diamond “?” is double now. For some reason copy+paste makes it double…

In the django traceback page, it looks like this:

JavaScript

The thing that messes me up is that I can’t do anything with this string without it throwing an exception. I tried unicode(), I tried str(), I tried .encode(), I tried .encode(“utf-8”), no matter what it throws up an error.

What can I do it get this thing to be a working string?

Advertisement

Answer

You can pass, “ignore” to skip invalid characters in .encode/.decode like "ILLEGAL".decode("utf8","ignore")

JavaScript
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement