Use PyPDF2 to detect non-embedded fonts in PDF file generated by Google Docs

Question

I was hoping someone could help me write a Python function to detect any fonts in the file which are not embedded in the file. I've attempted to use the script linked here, and it can detect the documents fonts, but it does not detect fonts which are embedded. I've pasted the script below for convenience: For example, I've downloaded

Accepted Answer

The issue is that this script does not handle lists. For example in the Google Docs example, in the PDF object, you see this structure:{'/Encoding': '/Identity-H', '/Type': '/Font', '/BaseFont': '/Pacifico-Regular', '/ToUnicode': IndirectObject(9, 0), '/DescendantFonts': [IndirectObject(16, 0)], '/Subtype': '/Type0'}The key DescendantFonts maps to a list of values, which if you recurse deeper into will contain the keys for font files. You have to modify the script to test for arrays as well, for example:if type(obj) == PyPDF2.generic.ArrayObject:  # You can also do ducktyping here    for i in obj:        if hasattr(i, 'keys'):            walk(i, all_fonts, embedded_fonts)

Advertisement

Answer