Skip to content
Advertisement

Regex – How to account for random spacing/line breaks in a term

I am using Python and have the following regular expression to extract text from text files:

    pattern = r'bItems+5.02s*([wW]*?)(?=s*(?:Items+[89].01|Items+5.03|Items+5.07|SIGNATURES|SIGNATURE|Pursuant to the requirements of the Securities Exchange Act of 1934)b)'

    pd_00['important_text'] = pd_00['text'].str.extract(pattern, re.IGNORECASE, expand=False)

My issue is specifically with the last term, “Pursuant to the requirements of the Securities Exchange Act of 1934”. In the text files, this sentence is sometimes spaced randomly and starts different parts of the sentence on new lines. How do I account for this randomness? Right now it is only picking it up when it is written with even, normal spacing.

Advertisement

Answer

First, note that your pattern is too verbose, you can shrink some parts:

Items+[89].01|Items+5.03|Items+5.07  =>  Items+(?:[89].01|5.0[37])
SIGNATURES|SIGNATURE                       =>  SIGNATURES?

SIGNATURES? matches SIGNATURES or SIGNATURE as S? matches one or zero S chars.

So, now, re-vamp the pattern as indicated and replace spaces in your pattern with s+:

pattern = r'bItems+5.02s*([wW]*?)(?=s*(?:Items+(?:[89].01|5.0[37])|SIGNATURES?|Pursuants+tos+thes+requirementss+ofs+thes+Securitiess+Exchanges+Acts+ofs+1934)b)'

See the regex demo.

Advertisement