Skip to content
Advertisement

How can I convince Python 3 to treat my text file as UTF-8?

I need a search and replace for a particular character in a few .php files contained in a local directory, under Windows OS.

I tried one of the examples given as answer for Do a search-and-replace across all files in a folder through python? question, which in my case means this:

JavaScript

While running it, I get this error:

JavaScript

Indeed, the 8 bit MS codepage CP1250 is “Undefined” at 0x83 and the absolute position 0x0571 (i.e. 1393 decimal) of the file where this error occurs contains the byte 0x83, which in fact in this case is part of the UTF-8 encoding for character ă (for which the complete UTF-8 bytes are 0xC4 0x83).

Questions:

● why tries Python 3 to read a text file in whatever 8 bit codepage instead of reading it directly in Unicode ?

● what can I do to force reading the file in true UTF-8 ?

Advertisement

Answer

Add the encoding option to the open function.

JavaScript

Further Details about the open builtin https://docs.python.org/3/library/functions.html?highlight=open#open

Currently Python uses the local system encoding, unless UTF-8 mode is enabled. There is a PEP proposed (not accepted as of now) to change the default to UTF-8 but even if accepted it’s a few Python versions away, so best to be explicit in the code.

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement