Skip to content
Advertisement

Extract Number of pages from a text column

I have a text column which contains comments like:

  1. 6 pages, LaTeX, no figures
  2. 19 pages, latex, 4 figures as uuencoded postscript files
  3. Invited Talk at the “VII Marcel Grossman Meeting on General Relativity” – Stanford, July 1994. 14 pages, latex, five figures, which will be available upon request.
  4. 15 pp. Phyzzx

I am looking to extract the number of pages from this. There are also some rows which do not have any comments or don’t have the info related to pages. So those should probably be NA.

Advertisement

Answer

This works as long as there is only one number of pages per comment.

import re
comments = [
"6 pages, LaTeX, no figures",
"112 cucumber",
"19 pages, latex, 4 figures as uuencoded postscript files",
"Invited Talk at the ``VII Marcel Grossman Meeting on General 
Relativity'' - Stanford, July 1994. 14 pages, latex, five figures, 
which will be available upon request.",
'15 pp. Phyzzx']

def page_num_extract(text:list) -> list:
  out = []
  for line in text:
    pages = re.findall("d* pages|d* pp.", line)
    pages = re.findall("d*", str(*pages))[0]
    if not pages:
      pages = "NA"
    out.append(pages)
  return out

page_num_extract(comments)

[‘6’, ‘NA’, ’19’, ’14’, ’15’]

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement