I have a text column which contains comments like:
- 6 pages, LaTeX, no figures
- 19 pages, latex, 4 figures as uuencoded postscript files
- Invited Talk at the “VII Marcel Grossman Meeting on General Relativity” – Stanford, July 1994. 14 pages, latex, five figures, which will be available upon request.
- 15 pp. Phyzzx
I am looking to extract the number of pages from this. There are also some rows which do not have any comments or don’t have the info related to pages. So those should probably be NA.
Advertisement
Answer
This works as long as there is only one number of pages per comment.
import re comments = [ "6 pages, LaTeX, no figures", "112 cucumber", "19 pages, latex, 4 figures as uuencoded postscript files", "Invited Talk at the ``VII Marcel Grossman Meeting on General Relativity'' - Stanford, July 1994. 14 pages, latex, five figures, which will be available upon request.", '15 pp. Phyzzx'] def page_num_extract(text:list) -> list: out = [] for line in text: pages = re.findall("d* pages|d* pp.", line) pages = re.findall("d*", str(*pages))[0] if not pages: pages = "NA" out.append(pages) return out
page_num_extract(comments)
[‘6’, ‘NA’, ’19’, ’14’, ’15’]