Error in the coding of the characters in reading a PDF

Question

I need to read this PDF. I am using the following code: However, the encoding is incorrect, it prints: But I expected How to solve it? I&#8217;m using Python 3 Answer The PyPDF2 extractTest method returns UniCode. So you many need to just explicitly encode it. For example, explicitly encoding the Unicode into…

Accepted Answer

The PyPDF2 extractTest method returns UniCode. So you many need to just explicitly encode it. For example, explicitly encoding the Unicode into UTF-8.# -*- coding: utf-8 -*-correct = u'Resultado da Prova de Seleção do...'print(correct.encode(encoding='utf-8'))You&#8217;re on Python 3, so you have Unicode under the hood, and Python 3 defaults to UTF-8. But I wonder if you need to specify a different encoding based on your locale.# Show installed localesimport localefrom pprint import pprintpprint(locale.locale_alias)If that&#8217;s not the quick fix, since you&#8217;re getting Unicode back from PyPDF, you could take a look at the code points for those two characters. It&#8217;s possible that PyPDF wasn&#8217;t able to determine the correct encoding and gave you the wrong characters.For example, a quick and dirty comparison of the good and bad strings you posted:# -*- coding: utf-8 -*-# Python 3.4incorrect = 'Resultado da Prova de Sele“‰o do'correct = 'Resultado da Prova de Seleção do...'print("Incorrect String")print("CHAR{}UNI".format(' ' * 20))print("-" * 50)for char in incorrect:    print(        '{}{}{}'.format(            char.encode(encoding='utf-8'),            ' ' * 20,  # Hack; Byte objects don't have __format__            ord(char)        )    )print("n" * 2)print("Correct String")print("CHAR{}UNI".format(' ' * 20))print("-" * 50)for char in correct:    print(        '{}{}{}'.format(            char.encode(encoding='utf-8'),            ' ' * 20,  # Hack; Byte objects don't have __format__            ord(char)        )    )Relevant Output:b&#8217;xe2x80x9c&#8217;                    8220b&#8217;xe2x80xb0&#8242;                    8240b&#8217;xc3xa7&#8242;                    231b&#8217;xc3xa3&#8242;                    227If you&#8217;re getting code point 231, (>>>hex(231)  # '0xe7) then you&#8217;re getting back bad data back from PyPDF.

Advertisement

Answer