Skip to content
Advertisement

Encoding German Character in Jupyter Notebook [closed]

I have a data set with German customer reviews that I would like to perform a Topic Model on. I am using Jypter Notebook with Python 3.7.10 and Windows 10. However, I am struggling to find the proper encoding.

JavaScript

prints utf-8

I have tried:

JavaScript

which yields

JavaScript

To demonstrate the problem,

JavaScript

produces

JavaScript

and

JavaScript

produces

JavaScript

while

JavaScript

producces

JavaScript

I also tried

JavaScript

which got me

JavaScript

In this case,

JavaScript

yields

JavaScript

A sample of the data set can be found here: https://docs.google.com/spreadsheets/d/1yR-cDo0asRetgcdlGrCtWHv2xQQwFVe9sk0avrByBvU/edit?usp=sharing

Advertisement

Answer

ufffd is the Unicode Replacement Character. It is used when something goes wrong during encoding.

Your data is already FUBAR. There is nothing you can do during decoding to fix it. Fix your data first.

(Well, you can take a guess by comparing the words you have with words from a German dictionary. But I wouldn’t recommend it, because it’s a lot of work and not guaranteed to yield correct results.)

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement