In a text file, there is a string “I don’t like this”.
However, when I read it into a string, it becomes “I donxe2x80x98t like this”. I understand that u2018 is the unicode representation of “‘”. I use
f1 = open (file1, "r") text = f1.read()
command to do the reading.
Now, is it possible to read the string in such a way that when it is read into the string, it is “I don’t like this”, instead of “I donxe2x80x98t like this like this”?
Second edit: I have seen some people use mapping to solve this problem, but really, is there no built-in conversion that does this kind of ANSI to unicode ( and vice versa) conversion?
Advertisement
Answer
Ref: http://docs.python.org/howto/unicode
Reading Unicode from a file is therefore simple:
import codecs with codecs.open('unicode.rst', encoding='utf-8') as f: for line in f: print repr(line)
It’s also possible to open files in update mode, allowing both reading and writing:
with codecs.open('test', encoding='utf-8', mode='w+') as f: f.write(u'u4500 blah blah blahn') f.seek(0) print repr(f.readline()[:1])
EDIT: I’m assuming that your intended goal is just to be able to read the file properly into a string in Python. If you’re trying to convert to an ASCII string from Unicode, then there’s really no direct way to do so, since the Unicode characters won’t necessarily exist in ASCII.
If you’re trying to convert to an ASCII string, try one of the following:
Replace the specific unicode chars with ASCII equivalents, if you are only looking to handle a few special cases such as this particular example
Use the
unicodedata
module’snormalize()
and thestring.encode()
method to convert as best you can to the next closest ASCII equivalent (Ref https://web.archive.org/web/20090228203858/http://techxplorer.com/2006/07/18/converting-unicode-to-ascii-using-python):>>> teststr u'I donxe2x80x98t like this' >>> unicodedata.normalize('NFKD', teststr).encode('ascii', 'ignore') 'I donat like this'