Skip to content
Advertisement

Encoding German Character in Jupyter Notebook [closed]

I have a data set with German customer reviews that I would like to perform a Topic Model on. I am using Jypter Notebook with Python 3.7.10 and Windows 10. However, I am struggling to find the proper encoding.

import sys
print(sys.getdefaultencoding())

prints utf-8

I have tried:

with open('C:/Users/TinnerF/Dropbox/ResearchProjects/RawData/LIWC_Output.csv') as f:
    print(f)

which yields

<_io.TextIOWrapper name='C:/Users/TinnerF/Dropbox/ResearchProjects/RawData/LIWC_Output.csv' mode='r' encoding='cp1252'>

To demonstrate the problem,

s = df['text'].iloc[0]
print(str(s.encode('cp1252',"ignore"),'utf-8'))

produces

*Sehr freundlicher Umgang mit den G�sten, kompetentes Personal, alle Reklamationen sofort zufriedenstellend erledigt, hervorragendes Essen*

and

print(s.encode('cp1252').decode('utf8'))

produces

*Sehr freundlicher Umgang mit den G�sten, kompetentes Personal, alle Reklamationen sofort zufriedenstellend erledigt, hervorragendes Essen*

while

print(s.encode('cp1252').decode('utf8').encode('cp1252').decode('utf8', 'ignore'))

producces

UnicodeEncodeError: 'charmap' codec can't encode character 'ufffd' in position 34: character maps to <undefined>

I also tried

%pip install ftfy
import ftfy
print(ftfy.fix_text(s))

which got me

*Sehr freundlicher Umgang mit den G�sten, kompetentes Personal, alle Reklamationen sofort zufriedenstellend erledigt, hervorragendes Essen*

In this case,

s.encode('ISO-8859-1')

yields

UnicodeEncodeError: 'latin-1' codec can't encode character 'ufffd' in position 34: ordinal not in range(256)

A sample of the data set can be found here: https://docs.google.com/spreadsheets/d/1yR-cDo0asRetgcdlGrCtWHv2xQQwFVe9sk0avrByBvU/edit?usp=sharing

Advertisement

Answer

ufffd is the Unicode Replacement Character. It is used when something goes wrong during encoding.

Your data is already FUBAR. There is nothing you can do during decoding to fix it. Fix your data first.

(Well, you can take a guess by comparing the words you have with words from a German dictionary. But I wouldn’t recommend it, because it’s a lot of work and not guaranteed to yield correct results.)

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement