I am using Python and have the following regular expression to extract text from text files:
pattern = r'bItems+5.02s*([wW]*?)(?=s*(?:Items+[89].01|Items+5.03|Items+5.07|SIGNATURES|SIGNATURE|Pursuant to the requirements of the Securities Exchange Act of 1934)b)' pd_00['important_text'] = pd_00['text'].str.extract(pattern, re.IGNORECASE, expand=False)
My issue is specifically with the last term, “Pursuant to the requirements of the Securities Exchange Act of 1934”. In the text files, this sentence is sometimes spaced randomly and starts different parts of the sentence on new lines. How do I account for this randomness? Right now it is only picking it up when it is written with even, normal spacing.
Advertisement
Answer
First, note that your pattern is too verbose, you can shrink some parts:
Items+[89].01|Items+5.03|Items+5.07 => Items+(?:[89].01|5.0[37]) SIGNATURES|SIGNATURE => SIGNATURES?
SIGNATURES?
matches SIGNATURES
or SIGNATURE
as S?
matches one or zero S
chars.
So, now, re-vamp the pattern as indicated and replace spaces in your pattern with s+
:
pattern = r'bItems+5.02s*([wW]*?)(?=s*(?:Items+(?:[89].01|5.0[37])|SIGNATURES?|Pursuants+tos+thes+requirementss+ofs+thes+Securitiess+Exchanges+Acts+ofs+1934)b)'
See the regex demo.