Regex – How to account for random spacing/line breaks in a term

Question

I am using Python and have the following regular expression to extract text from text files: My issue is specifically with the last term, &#8220;Pursuant to the requirements of the Securities Exchange Act of 1934&#8221;. In the text files, this sentence is sometimes spaced randomly and starts different parts …

Accepted Answer

First, note that your pattern is too verbose, you can shrink some parts:Items+[89].01|Items+5.03|Items+5.07  =>  Items+(?:[89].01|5.0[37])SIGNATURES|SIGNATURE                       =>  SIGNATURES?SIGNATURES? matches SIGNATURES or SIGNATURE as S? matches one or zero S chars.So, now, re-vamp the pattern as indicated and replace spaces in your pattern with s+:pattern = r'bItems+5.02s*([wW]*?)(?=s*(?:Items+(?:[89].01|5.0[37])|SIGNATURES?|Pursuants+tos+thes+requirementss+ofs+thes+Securitiess+Exchanges+Acts+ofs+1934)b)'See the regex demo.

Advertisement

Answer