Error in the coding of the characters in reading a PDF

Question

I need to read this PDF. I am using the following code: However, the encoding is incorrect, it prints: But I expected How to solve it? I'm using Python 3 Answer The PyPDF2 extractTest method returns UniCode. So you many need to just explicitly encode it. For example, explicitly encoding the Unicode into UTF-8. You're on Python 3, so you

Accepted Answer

The PyPDF2 extractTest method returns UniCode. So you many need to just explicitly encode it. For example, explicitly encoding the Unicode into UTF-8.# -*- coding: utf-8 -*-correct = u'Resultado da Prova de Seleção do...'print(correct.encode(encoding='utf-8'))You&#8217;re on Python 3, so you have Unicode under the hood, and Python 3 defaults to UTF-8. But I wonder if you need to specify a different encoding based on your locale.# Show installed localesimport localefrom pprint import pprintpprint(locale.locale_alias)If that&#8217;s not the quick fix, since you&#8217;re getting Unicode back from PyPDF, you could take a look at the code points for those two characters. It&#8217;s possible that PyPDF wasn&#8217;t able to determine the correct encoding and gave you the wrong characters.For example, a quick and dirty comparison of the good and bad strings you posted:# -*- coding: utf-8 -*-# Python 3.4incorrect = 'Resultado da Prova de Sele“‰o do'correct = 'Resultado da Prova de Seleção do...'print("Incorrect String")print("CHAR{}UNI".format(' ' * 20))print("-" * 50)for char in incorrect:    print(        '{}{}{}'.format(            char.encode(encoding='utf-8'),            ' ' * 20,  # Hack; Byte objects don't have __format__            ord(char)        )    )print("n" * 2)print("Correct String")print("CHAR{}UNI".format(' ' * 20))print("-" * 50)for char in correct:    print(        '{}{}{}'.format(            char.encode(encoding='utf-8'),            ' ' * 20,  # Hack; Byte objects don't have __format__            ord(char)        )    )Relevant Output:b&#8217;xe2x80x9c&#8217;                    8220b&#8217;xe2x80xb0&#8242;                    8240b&#8217;xc3xa7&#8242;                    231b&#8217;xc3xa3&#8242;                    227If you&#8217;re getting code point 231, (>>>hex(231)  # '0xe7) then you&#8217;re getting back bad data back from PyPDF.

Advertisement

Answer