I wanted to convert my text file into csv file, however, my output seems to be very different from what I expected. Below are the examples: text.txt (Encoding is "UTF-8") text = My code: Output: enter image description here My expected output: enter image description here It seems like I am getting quite a lot of ....... for the japanese

text to csv file for japanese characters(Errors in arrangements)

I wanted to convert my text file into csv file, however, my output seems to be very different from what I expected. Below are the examples:

text.txt (Encoding is “UTF-8”)

text =

-0.00010712468871868001 gram_0:Coll:0::ん
-0.00010712468871868001 gram-1:Coll:-1::止まる
-0.00010712468871868001 gram-3:Coll:-3::帰る
-0.00010712468871868001 gram1:Coll:0::ん
-0.00010712468871868001 gram2:Coll:2::いく
-0.00010712468871868001 gram3:Coll:3::く

JavaScript
​x
 
-0.00010712468871868001 gram_0:Coll:0::ん
-0.00010712468871868001 gram-1:Coll:-1::止まる
-0.00010712468871868001 gram-3:Coll:-3::帰る
-0.00010712468871868001 gram1:Coll:0::ん
-0.00010712468871868001 gram2:Coll:2::いく
-0.00010712468871868001 gram3:Coll:3::く
​

My code:

import csv

with open('text.txt', 'r', encoding="utf-8") as in_file:
    stripped = (line.strip() for line in in_file)
    lines = (line.split(",") for line in stripped if line)
    with open('log.csv', 'w', encoding="utf-8") as out_file:
        writer = csv.writer(out_file)
        writer.writerow(('title', 'intro'))
        writer.writerows(lines)

JavaScript
 
import csv
​
with open('text.txt', 'r', encoding="utf-8") as in_file:
    stripped = (line.strip() for line in in_file)
    lines = (line.split(",") for line in stripped if line)
    with open('log.csv', 'w', encoding="utf-8") as out_file:
        writer = csv.writer(out_file)
        writer.writerow(('title', 'intro'))
        writer.writerows(lines)
​

Output:

enter image description here

My expected output:

enter image description here

It seems like I am getting quite a lot of ……. for the japanese characters. Could anyone please assist me on this?

Answer

Windows use the BOM to determine encoding of text, but Python does not seem to auto-generate the BOM, and Windows may recognize the output file as ANSI. Try adding out_file.write('ufeff') immediately after the inner with.

Source: Adding BOM (unicode signature) while saving file in python

Advertisement

Answer