I have some text from a pdf in one string, I want to break it up so that I have a list where every string starts with a digit and a period, and then stops before the next number.
For example I want to turn this:
'3.1 First liens 15,209,670,396 0 15,209,670,396 14,216,703,858 3.2 Other than first liens 0 0 4. Real estate: 4.1 Properties occupied by the company (less $ 43,332,898 encumbrances) 68,122,291 0 68,122,291 64,237,046 4.2 Properties held for the production of income (less $ encumbrances) 0 0 4.3 Properties held for sale (less $ encumbrances) 0 0 5. Cash ($ (101,130,138)), cash equivalents ($ 850,185,973 ) and short-term investments ($ 0 ) 749,055,835 0 749,055,835 1,867,997,055 6. Contract loans (including $ premium notes) 253,533,676 0 253,533,676 233,680,271 7. Derivatives 3,194,189,871 0 3,194,189,871 2,390,781,023 8. Other invested assets 749,074,191 11,899,360 737,174,831 692,916,503'
Into this:
['3.1 First liens 15,209,670,396 0 15,209,670,396 14,216,703,858 ', '3.2 Other than first liens 0 0 ', '4. Real estate:', '4.1 Properties occupied by the company (less $ 43,332,898 encumbrances) 68,122,291 0 68,122,291 64,237,046', '4.2 Properties held for the production of income (less $ encumbrances) 0 0' '4.3 Properties held for sale (less $ encumbrances) 0 0', '5. Cash ($ (101,130,138)), cash equivalents ($ 850,185,973 ) and short-term investments ($ 0 ) 749,055,835 0 749,055,835 1,867,997,055', '6. Contract loans (including $ premium notes) 253,533,676 0 253,533,676 233,680,271', '7. Derivatives 3,194,189,871 0 3,194,189,871 2,390,781,023', '8. Other invested assets 749,074,191 11,899,360 737,174,831 692,916,503']
The issue is that the original string has ‘n’ scattered in the middle of the titles (for example in 4.1 theres a n before the word encumbrances.
(d+.[sS]*(?!d+.))
This is the regex I’ve been trying to use but it matches the whole string instead of each number line. Is there any way for my regex to stop the match right before the next number line?
Advertisement
Answer
Something like:
list = re.findall(r"^d+..*?(?=^d+.|Z)", text, re.MULTILINE | re.DOTALL)
Further explanation on request.