unicode_escape without deprecation warning

Tags:



In Python 3.8, the following:

import codecs

codecs.decode("hello\,world", "unicode_escape")

…produces a deprecation warning:

<ipython-input-7-28f185d30178>:1: DeprecationWarning: invalid escape sequence ','
  codecs.decode("hello\,", "unicode_escape", errors="strict")
  1. Is this going to become an error in a future version of Python? Where is the reference for this?
  2. If No, is there a way to not display this warning, if Yes, what can I use instead?

Edit: I am (obviously) not interested in fixing this for a string I would have written literally in my Python script. This is purely an example, in my use case strings come from external files, I cannot change them, and some contains invalid escape sequences such as this.

Answer

The problem is that you’ve got two layers of de-escaping occurring here. The first one is the regular string literal escaping, converting "hello\,world" to a string with raw characters of hello,world. Then codecs.decode tries to decode it with unicode_escape, which sees it as an attempt to escape , with , which is an invalid escape.

The fix for your code as written is to use a raw string so the first level of escaping doesn’t happen:

codecs.decode(r"hello\,world", "unicode_escape")
            # ^ Now it's a raw string, and both backslashes are in fact backslashes

If your data comes from elsewhere with invalid escapes, you can suppress the warning for now (see the warnings module for details), but it will eventually cause an exception in a future version of Python, so the long term solution is “Don’t provide invalid data”. Sorry that’s not super-helpful.

The reason you get this warning is that, historically, Python has been kind of lax about spurious escapes. So if you wrote, say, a Windows path of "C:yes" it said “hey, y doesn’t mean anything, so we’ll just assume they wanted a literal backslash”, while an equivalent path of "C:no" saw n and thought “Yup, they want a newline there”.

This taught bad habits (not using raw strings when you should, because it usually worked without them) and created confusion when those habits bit you (why isn’t it working this time!?!). So in the future, escapes like y will be treated as errors, so that you end up writing r'C:yes' and are used to doing so so you don’t get bit by r'C:no'. The warning is reminding you that this is bad code, and will eventually stop working (for good reason; as your own comment notes, you’re okay with it ending up with either no backslashes or two backslashes, starting with just one, which is an insane set of options to accept without knowing the single, correct, desired result).


Alternative solution

If your goal is to “fix” bad strings, the best solution is probably to just write your own simple stripping regex, e.g.:

import re

bad_escape_re = re.compile(r'\(?=[^n\'"abfnrtv0-7xNuU])')

and then use it to strip unrecognized escapes a la:

good_string = bad_escape_re.sub('', bad_string)

which when run like so:

bad_escape_re.sub('', r'abcdefg,.ntx12')

produces a string with the repr '\a\bcde\fg,.\n\t\x12'. Note that’s it’s not perfect, and if you need it to validate the extended escapes to distinguish valid uses from invalid ones ([0-7], x, N, u and U), it gets more complicated, but those are also cases where it’s invariably heuristic and there is no good solution; without a human to interpret, xab is legal and xag is not, but it’s entirely likely the former wasn’t intended as an escape either.



Source: stackoverflow