Skip to content
Advertisement

Use PyPDF2 to detect non-embedded fonts in PDF file generated by Google Docs

I was hoping someone could help me write a Python function to detect any fonts in the file which are not embedded in the file. I’ve attempted to use the script linked here, and it can detect the documents fonts, but it does not detect fonts which are embedded. I’ve pasted the script below for convenience:

import sys
from PyPDF2 import PdfFileReader

fontkeys = set(["/FontFile", "/FontFile2", "/FontFile3"])


def walk(obj, fnt, emb):
    if "/BaseFont" in obj:
        fnt.add(obj["/BaseFont"])

    elif "/FontName" in obj and fontkeys.intersection(set(obj)):
        emb.add(obj["/FontName"])

    for k in obj:
        if hasattr(obj[k], "keys"):
            walk(obj[k], fnt, emb)

    return fnt, emb


if __name__ == "__main__":
    file_name = sys.argv[1]
    reader = PdfFileReader(file_name)
    fonts = set()
    embedded = set()

    for page in reader.pages:
        obj = page.getObject()
        f, e = walk(obj["/Resources"], fonts, embedded)
        fonts = fonts.union(f)
        embedded = embedded.union(e)

    unembedded = fonts - embedded
    print("Font List")
    print(sorted(list(fonts)))
    if unembedded:
        print("nUnembedded Fonts")
        print(unembedded)

For example, I’ve downloaded a PDF from Google Docs (type some stuff, save as PDF) with the Arial font, and Adobe Reader has confirmed that the font is embedded. However, the script returns [‘/ArialMT’] as a font, and an empty set for embedded fonts. Additionally, it does not look like any of the recursive objects have the keys {'/FontFile', '/FontFile2', '/FontFile3'}. I’ve tried it on other PDFs and it works, so it must be something weird with the Google Docs PDFs. Let me know what other debug information I can give for this PDF file.

Adobe confirming that the font is embedded

One thing I thought was that it was possible that Google Docs was only embedding fonts which were not in the 14 standard PDF fonts. However, i tried it with a weird font (pacifico), and the script also stated this font was not embedded, when Adobe claims it is. Pacifico being embedded, script saying its not

I tried it with this PDF, and the script correctly stated that these 14 fonts were embedded.

Advertisement

Answer

The issue is that this script does not handle lists. For example in the Google Docs example, in the PDF object, you see this structure:

{'/Encoding': '/Identity-H', '/Type': '/Font', '/BaseFont': '/Pacifico-Regular', '/ToUnicode': IndirectObject(9, 0), '/DescendantFonts': [IndirectObject(16, 0)], '/Subtype': '/Type0'}

The key DescendantFonts maps to a list of values, which if you recurse deeper into will contain the keys for font files. You have to modify the script to test for arrays as well, for example:

if type(obj) == PyPDF2.generic.ArrayObject:  # You can also do ducktyping here
    for i in obj:
        if hasattr(i, 'keys'):
            walk(i, all_fonts, embedded_fonts)
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement