I am quite new to python and PDFminer which is a bit complex for me, what I am trying to achieve is extract the title each page from a pdf file or slides.
My approach is getting a list of the text lines and the font size per page, then I will pick the highest number as the slide heading usually written in a higher font size.
This is what I did so far:
Suppose I want to get the page #8 title from this pdf file. File sample
This is how page #8 content looks like:
This is the code to get all pages font size per line:
from pdfminer.high_level import extract_pages from pdfminer.layout import LTTextContainer, LTChar,LTLine,LAParams import os path=r'cov.pdf' Extract_Data=[] for page_layout in extract_pages(path): for element in page_layout: if isinstance(element, LTTextContainer): for text_line in element: for character in text_line: if isinstance(character, LTChar): Font_size=character.size Extract_Data.append([Font_size,(element.get_text())])
The generated list Extract_Data
is for all pages of the pdf document. My question is how can I get this list for each page (iteration) of the document?
expected output for page number 8 only and so on for each page / then if I want to pick the page title, it will be the item(line) with the highest value in font size:
[[32.039999999999964, 'Pandemic declaration n'], [24.0, ' n'], [24.0, ' n'], [24.0, '• On March 11, 2020, the World Health Organization n(WHO) characterized COVID-19 as a pandemic. n n• It has caused severe illness and death. It features n nsustained person-to-person spread worldwide. n'], [24.0, ' n'], [24.0, ' n'], [24.0, ' n'], [24.0, ' n'], [24.0, '• It poses an especially high risk for the elderly (60 or n n'], [24.0, ' n'], [24.0, ' n'], [24.0, ' n'], [24.0, ' n'], [24.0, ' n'], [24.0, 'older), people with preexisting health conditions such nas high blood pressure, heart disease, lung disease, n ndiabetes, autoimmune disorders, and certain workers. n n'], [24.0, ' n'], [24.0, ' n'], [24.0, ' n'], [24.0, ' n'], [24.0, ' n'], [14.04, '8 n']]
Advertisement
Answer
Full disclosure, I’m one of the maintainers of pdfminer.six.
A pythonic way of doing this would be the following.
import os from pdfminer.high_level import extract_pages from pdfminer.layout import LTTextContainer, LTChar def get_font_sizes(paragraph: LTTextContainer): """Get the font sizes for every LTChar element in this LTTextContainer""" return [ char.size for line in paragraph for char in line if isinstance(char, LTChar) ] def list_sized_paragraphs(page): """List all the paragraphs and their maximum font size on this page""" return [ (max(get_font_sizes(paragraph)), paragraph.get_text()) for paragraph in page if isinstance(paragraph, LTTextContainer) ] file_path = '~/Downloads/covid_19_training_tool_v3_01.05.2021_508.pdf' for page in extract_pages(os.path.expanduser(file_path)): _, text = max(list_sized_paragraphs(page)) print('---') print(text.strip())
For page 8 this prints:
Pandemic declaration
Note: this does not work for all pages because sometimes a caution or note has a bigger font size than then header.