Regex – How to account for random spacing/line breaks in a term

Question

I am using Python and have the following regular expression to extract text from text files: My issue is specifically with the last term, "Pursuant to the requirements of the Securities Exchange Act of 1934". In the text files, this sentence is sometimes spaced randomly and starts different parts of the sentence on new lines. How do I account for

Accepted Answer

First, note that your pattern is too verbose, you can shrink some parts:Items+[89].01|Items+5.03|Items+5.07  =>  Items+(?:[89].01|5.0[37])SIGNATURES|SIGNATURE                       =>  SIGNATURES?SIGNATURES? matches SIGNATURES or SIGNATURE as S? matches one or zero S chars.So, now, re-vamp the pattern as indicated and replace spaces in your pattern with s+:pattern = r'bItems+5.02s*([wW]*?)(?=s*(?:Items+(?:[89].01|5.0[37])|SIGNATURES?|Pursuants+tos+thes+requirementss+ofs+thes+Securitiess+Exchanges+Acts+ofs+1934)b)'See the regex demo.

Advertisement

Answer