Skip to content
Advertisement

Error in the coding of the characters in reading a PDF

I need to read this PDF.

I am using the following code:

JavaScript

However, the encoding is incorrect, it prints:

JavaScript

But I expected

JavaScript

How to solve it?

I’m using Python 3

Advertisement

Answer

The PyPDF2 extractTest method returns UniCode. So you many need to just explicitly encode it. For example, explicitly encoding the Unicode into UTF-8.

JavaScript

You’re on Python 3, so you have Unicode under the hood, and Python 3 defaults to UTF-8. But I wonder if you need to specify a different encoding based on your locale.

JavaScript

If that’s not the quick fix, since you’re getting Unicode back from PyPDF, you could take a look at the code points for those two characters. It’s possible that PyPDF wasn’t able to determine the correct encoding and gave you the wrong characters.

For example, a quick and dirty comparison of the good and bad strings you posted:

JavaScript

Relevant Output:

b’xe2x80x9c’ 8220
b’xe2x80xb0′ 8240

b’xc3xa7′ 231
b’xc3xa3′ 227

If you’re getting code point 231, (>>>hex(231) # '0xe7) then you’re getting back bad data back from PyPDF.

Advertisement