I need to read this PDF.
I am using the following code:
from PyPDF2 import PdfFileReader f = open('myfile.pdf', 'rb') reader = PdfFileReader(f) content = reader.getPage(0).extractText() f.close() content = ' '.join(content.replace('xa0', ' ').strip().split()) print(content)
However, the encoding is incorrect, it prints:
Resultado da Prova de Sele“‰o do...
But I expected
Resultado da Prova de Seleção do...
How to solve it?
I’m using Python 3
Advertisement
Answer
The PyPDF2 extractTest method returns UniCode. So you many need to just explicitly encode it. For example, explicitly encoding the Unicode into UTF-8.
# -*- coding: utf-8 -*- correct = u'Resultado da Prova de Seleção do...' print(correct.encode(encoding='utf-8'))
You’re on Python 3, so you have Unicode under the hood, and Python 3 defaults to UTF-8. But I wonder if you need to specify a different encoding based on your locale.
# Show installed locales import locale from pprint import pprint pprint(locale.locale_alias)
If that’s not the quick fix, since you’re getting Unicode back from PyPDF, you could take a look at the code points for those two characters. It’s possible that PyPDF wasn’t able to determine the correct encoding and gave you the wrong characters.
For example, a quick and dirty comparison of the good and bad strings you posted:
# -*- coding: utf-8 -*- # Python 3.4 incorrect = 'Resultado da Prova de Sele“‰o do' correct = 'Resultado da Prova de Seleção do...' print("Incorrect String") print("CHAR{}UNI".format(' ' * 20)) print("-" * 50) for char in incorrect: print( '{}{}{}'.format( char.encode(encoding='utf-8'), ' ' * 20, # Hack; Byte objects don't have __format__ ord(char) ) ) print("n" * 2) print("Correct String") print("CHAR{}UNI".format(' ' * 20)) print("-" * 50) for char in correct: print( '{}{}{}'.format( char.encode(encoding='utf-8'), ' ' * 20, # Hack; Byte objects don't have __format__ ord(char) ) )
Relevant Output:
b’xe2x80x9c’ 8220
b’xe2x80xb0′ 8240b’xc3xa7′ 231
b’xc3xa3′ 227
If you’re getting code point 231, (>>>hex(231) # '0xe7
) then you’re getting back bad data back from PyPDF.