Skip to content
Advertisement

pdfplumber | Extract text from dynamic column layouts

Attempted Solution at bottom of post.

I have near-working code that extracts the sentence containing a phrase, across multiple lines.

However, some pages have columns. So respective outputs are incorrect; where separate texts are wrongly merged together as a bad sentence.

This problem has been addressed in the following posts:


Question:

How do I “if-condition” whether there are columns?

  • Pages may not have columns,
  • Pages may have more than 2 columns.
  • Pages may also have headers and footers (that can be left out).

Example .pdf with dynamic text layout: PDF (pg. 2).

Jupyter Notebook:

JavaScript

Example Incorrect Output:

JavaScript

Attempted Minimal Solution: This will separate text into 2 columns; regardless if there are 2.

JavaScript

Please let me know if there is anything else I should clarify.

Advertisement

Answer

This answer enables you to scrape text, in the intended order.

Towards Data Science article PDF Text Extraction in Python:

Compared with PyPDF2, PDFMiner’s scope is much more limited, it really focuses only on extracting the text from the source information of a pdf file.

JavaScript

Cleansing can be applied thereafter.

Advertisement