How can I convince Python 3 to treat my text file as UTF-8?

I need a search and replace for a particular character in a few .php files contained in a local directory, under Windows OS.

I tried one of the examples given as answer for Do a search-and-replace across all files in a folder through python? question, which in my case means this:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import os
import re

_replace_re = re.compile("ş")

for dirpath, dirnames, filenames in os.walk("./something/"):

    for file in filenames:

        if file.endswith(".php"):
            file = os.path.join(dirpath, file)
            tempfile = file + ".temp"
            with open(tempfile, "w") as target:
                with open(file) as source:

                    for line in source:
                        line = _replace_re.sub("ș", line)
                        target.write(line)

            os.remove(file)
            os.rename(tempfile, file)

JavaScript
​x
 
#!/usr/bin/env python
# -*- coding: utf-8 -*-
​
import os
import re
​
_replace_re = re.compile("ş")
​
for dirpath, dirnames, filenames in os.walk("./something/"):
​
    for file in filenames:
​
        if file.endswith(".php"):
            file = os.path.join(dirpath, file)
            tempfile = file + ".temp"
            with open(tempfile, "w") as target:
                with open(file) as source:
​
                    for line in source:
                        line = _replace_re.sub("ș", line)
                        target.write(line)
​
            os.remove(file)
            os.rename(tempfile, file)
​

While running it, I get this error:

Traceback (most recent call last):
  File "[unimportant_path]replace.py", line 19, in <module>
    for line in source:
  File "C:python31.064libencodingscp1250.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x83 in position 1393: character maps to <undefined>

JavaScript
 
Traceback (most recent call last):
  File "[unimportant_path]replace.py", line 19, in <module>
    for line in source:
  File "C:python31.064libencodingscp1250.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x83 in position 1393: character maps to <undefined>
​

Indeed, the 8 bit MS codepage CP1250 is “Undefined” at 0x83 and the absolute position 0x0571 (i.e. 1393 decimal) of the file where this error occurs contains the byte 0x83, which in fact in this case is part of the UTF-8 encoding for character ă (for which the complete UTF-8 bytes are 0xC4 0x83).

Questions:

● why tries Python 3 to read a text file in whatever 8 bit codepage instead of reading it directly in Unicode ?

● what can I do to force reading the file in true UTF-8 ?

Answer

Add the encoding option to the open function.

        with open(tempfile, "w", encoding="utf-8") as target:
            with open(file, encoding="utf-8") as source:

JavaScript
 
        with open(tempfile, "w", encoding="utf-8") as target:
            with open(file, encoding="utf-8") as source:
​

Further Details about the open builtin https://docs.python.org/3/library/functions.html?highlight=open#open

Currently Python uses the local system encoding, unless UTF-8 mode is enabled. There is a PEP proposed (not accepted as of now) to change the default to UTF-8 but even if accepted it’s a few Python versions away, so best to be explicit in the code.

Advertisement

Answer