I have a data set with German customer reviews that I would like to perform a Topic Model on. I am using Jypter Notebook with Python 3.7.10 and Windows 10. However, I am struggling to find the proper encoding.
import sys print(sys.getdefaultencoding())
prints utf-8
I have tried:
with open('C:/Users/TinnerF/Dropbox/ResearchProjects/RawData/LIWC_Output.csv') as f: print(f)
which yields
<_io.TextIOWrapper name='C:/Users/TinnerF/Dropbox/ResearchProjects/RawData/LIWC_Output.csv' mode='r' encoding='cp1252'>
To demonstrate the problem,
s = df['text'].iloc[0] print(str(s.encode('cp1252',"ignore"),'utf-8'))
produces
*Sehr freundlicher Umgang mit den G�sten, kompetentes Personal, alle Reklamationen sofort zufriedenstellend erledigt, hervorragendes Essen*
and
print(s.encode('cp1252').decode('utf8'))
produces
*Sehr freundlicher Umgang mit den G�sten, kompetentes Personal, alle Reklamationen sofort zufriedenstellend erledigt, hervorragendes Essen*
while
print(s.encode('cp1252').decode('utf8').encode('cp1252').decode('utf8', 'ignore'))
producces
UnicodeEncodeError: 'charmap' codec can't encode character 'ufffd' in position 34: character maps to <undefined>
I also tried
%pip install ftfy import ftfy print(ftfy.fix_text(s))
which got me
*Sehr freundlicher Umgang mit den G�sten, kompetentes Personal, alle Reklamationen sofort zufriedenstellend erledigt, hervorragendes Essen*
In this case,
s.encode('ISO-8859-1')
yields
UnicodeEncodeError: 'latin-1' codec can't encode character 'ufffd' in position 34: ordinal not in range(256)
A sample of the data set can be found here: https://docs.google.com/spreadsheets/d/1yR-cDo0asRetgcdlGrCtWHv2xQQwFVe9sk0avrByBvU/edit?usp=sharing
Advertisement
Answer
ufffd is the Unicode Replacement Character. It is used when something goes wrong during encoding.
Your data is already FUBAR. There is nothing you can do during decoding to fix it. Fix your data first.
(Well, you can take a guess by comparing the words you have with words from a German dictionary. But I wouldn’t recommend it, because it’s a lot of work and not guaranteed to yield correct results.)