Encoding German Character in Jupyter Notebook [clo…

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.

This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.

Closed 8 months ago.

Improve this question

I have a data set with German customer reviews that I would like to perform a Topic Model on. I am using Jypter Notebook with Python 3.7.10 and Windows 10. However, I am struggling to find the proper encoding.

import sys
print(sys.getdefaultencoding())

prints utf-8

I have tried:

with open('C:/Users/TinnerF/Dropbox/ResearchProjects/RawData/LIWC_Output.csv') as f:
    print(f)

which yields

<_io.TextIOWrapper name='C:/Users/TinnerF/Dropbox/ResearchProjects/RawData/LIWC_Output.csv' mode='r' encoding='cp1252'>

To demonstrate the problem,

s = df['text'].iloc[0]
print(str(s.encode('cp1252',"ignore"),'utf-8'))

produces

*Sehr freundlicher Umgang mit den G�sten, kompetentes Personal, alle Reklamationen sofort zufriedenstellend erledigt, hervorragendes Essen*

and

print(s.encode('cp1252').decode('utf8'))

produces

*Sehr freundlicher Umgang mit den G�sten, kompetentes Personal, alle Reklamationen sofort zufriedenstellend erledigt, hervorragendes Essen*

while

print(s.encode('cp1252').decode('utf8').encode('cp1252').decode('utf8', 'ignore'))

producces

UnicodeEncodeError: 'charmap' codec can't encode character 'ufffd' in position 34: character maps to <undefined>

I also tried

%pip install ftfy
import ftfy
print(ftfy.fix_text(s))

which got me

*Sehr freundlicher Umgang mit den G�sten, kompetentes Personal, alle Reklamationen sofort zufriedenstellend erledigt, hervorragendes Essen*

In this case,

s.encode('ISO-8859-1')

yields

UnicodeEncodeError: 'latin-1' codec can't encode character 'ufffd' in position 34: ordinal not in range(256)

A sample of the data set can be found here: https://docs.google.com/spreadsheets/d/1yR-cDo0asRetgcdlGrCtWHv2xQQwFVe9sk0avrByBvU/edit?usp=sharing

Answer

ufffd is the Unicode Replacement Character. It is used when something goes wrong during encoding.

Your data is already FUBAR. There is nothing you can do during decoding to fix it. Fix your data first.

(Well, you can take a guess by comparing the words you have with words from a German dictionary. But I wouldn’t recommend it, because it’s a lot of work and not guaranteed to yield correct results.)

Encoding German Character in Jupyter Notebook [closed]

Advertisement

Answer