Skip to content
Advertisement

How to make my regex match stop after a lookahead?

I have some text from a pdf in one string, I want to break it up so that I have a list where every string starts with a digit and a period, and then stops before the next number.

For example I want to turn this:

'3.1 First liens  15,209,670,396  0  15,209,670,396  14,216,703,858 
3.2 Other than first liens     0  0 
4. Real estate:
4.1 Properties occupied by  the company (less $  43,332,898 
encumbrances)  68,122,291  0  68,122,291  64,237,046 
4.2 Properties held for  the production of income (less 
$    encumbrances)       0  0 
4.3 Properties held for sale (less $  
encumbrances)      0  0 
5. Cash ($  (101,130,138)), cash equivalents 
($ 850,185,973 ) and short-term
 investments ($ 0 )  749,055,835  0  749,055,835  1,867,997,055 
6. Contract loans (including $   premium notes)  253,533,676  0  253,533,676  233,680,271 
7. Derivatives  3,194,189,871  0  3,194,189,871  2,390,781,023 
8. Other invested assets  749,074,191  11,899,360  737,174,831  692,916,503' 

Into this:

['3.1 First liens  15,209,670,396  0  15,209,670,396  14,216,703,858 ',
'3.2 Other than first liens     0  0 ',
'4. Real estate:',
'4.1 Properties occupied by  the company (less $  43,332,898 encumbrances)  68,122,291  0  68,122,291  64,237,046',
'4.2 Properties held for  the production of income (less $    encumbrances)       0  0' 
'4.3 Properties held for sale (less $  encumbrances)      0  0',
'5. Cash ($  (101,130,138)), cash equivalents ($ 850,185,973 ) and short-term investments ($ 0 ) 
749,055,835  0  749,055,835  1,867,997,055',
'6. Contract loans (including $   premium notes)  253,533,676  0  253,533,676  233,680,271',
'7. Derivatives  3,194,189,871  0  3,194,189,871  2,390,781,023',
'8. Other invested assets  749,074,191  11,899,360  737,174,831  692,916,503']

The issue is that the original string has ‘n’ scattered in the middle of the titles (for example in 4.1 theres a n before the word encumbrances.

(d+.[sS]*(?!d+.))

This is the regex I’ve been trying to use but it matches the whole string instead of each number line. Is there any way for my regex to stop the match right before the next number line?

Advertisement

Answer

Something like:

list = re.findall(r"^d+..*?(?=^d+.|Z)", text, re.MULTILINE | re.DOTALL)

Further explanation on request.

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement