Skip to content
Advertisement

PDFminer get font size from headers per each page (iteration)

I am quite new to python and PDFminer which is a bit complex for me, what I am trying to achieve is extract the title each page from a pdf file or slides.

My approach is getting a list of the text lines and the font size per page, then I will pick the highest number as the slide heading usually written in a higher font size.

This is what I did so far:

Suppose I want to get the page #8 title from this pdf file. File sample

This is how page #8 content looks like:

enter image description here

This is the code to get all pages font size per line:

JavaScript

The generated list Extract_Data is for all pages of the pdf document. My question is how can I get this list for each page (iteration) of the document?

expected output for page number 8 only and so on for each page / then if I want to pick the page title, it will be the item(line) with the highest value in font size:

JavaScript

Advertisement

Answer

Full disclosure, I’m one of the maintainers of pdfminer.six.

A pythonic way of doing this would be the following.

JavaScript

For page 8 this prints:

JavaScript

Note: this does not work for all pages because sometimes a caution or note has a bigger font size than then header.

User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement