PDFminer get font size from headers per each page (iteration)

Question

I am quite new to python and PDFminer which is a bit complex for me, what I am trying to achieve is extract the title each page from a pdf file or slides. My approach is getting a list of the text lines and the font size per page, then I will pick the highest number as the slide heading

Accepted Answer

Full disclosure, I&#8217;m one of the maintainers of pdfminer.six.A pythonic way of doing this would be the following.import osfrom pdfminer.high_level import extract_pagesfrom pdfminer.layout import LTTextContainer, LTChardef get_font_sizes(paragraph: LTTextContainer):    """Get the font sizes for every LTChar element in this LTTextContainer"""    return [        char.size        for line in paragraph        for char in line        if isinstance(char, LTChar)    ]def list_sized_paragraphs(page):    """List all the paragraphs and their maximum font size on this page"""    return [        (max(get_font_sizes(paragraph)), paragraph.get_text())        for paragraph in page        if isinstance(paragraph, LTTextContainer)    ]file_path = '~/Downloads/covid_19_training_tool_v3_01.05.2021_508.pdf'for page in extract_pages(os.path.expanduser(file_path)):    _, text = max(list_sized_paragraphs(page))    print('---')    print(text.strip())For page 8 this prints:Pandemic declarationNote: this does not work for all pages because sometimes a caution or note has a bigger font size than then header.

Advertisement

Answer