Skip to content
Advertisement

Python splitting text with line breaks into a list

I’m trying to convert some text into a list. The text contains special characters, numbers, and line breaks. Ultimately I want to have a list with each word as an item in the list without any special characters, numbers, or spaces.

exerpt from text:

I have no ambition to lose my life on the post-road between St. Petersburgh and Archangel. <the&lt I

Currently I’m using this line to split each word into an item in the list:

text_list = [re.sub(r"[^a-zA-Z0-9]+", ' ', k) 
   for k in content.split(" ")]
print(text_list)

This code is leaving in spaces and combining words in each item of the list like below

Result:

['I', 'have', 'no', 'ambition', 'to', 'lose', 'my', 'life', 'on', 'the',
 'post road', 'between St ', 'Petersburgh', 'and', 'Archangel ', ' lt the lt I']

I would like to split the words into individual items of the list and remove the string ‘ lt ‘ and numbers from my list items.

Expected result:

['I', 'have', 'no', 'ambition', 'to', 'lose', 'my', 'life', 'on', 'the',
 'post', 'road', 'between', 'St', 'Petersburgh', 'and', 'Archangel', 'the' 'I']

Please help me resolve this issue.

Thanks

Advertisement

Answer

Since it looks like you’re parsing html text, it’s likely all entities are enclosed in & and ;. Removing those makes matching the rest quite easy.

import re

content = 'I have no ambition to lose my life on the post-road between St. Petersburgh and Archangel. <the< I'

# first, remove entities, the question mark makes sure the expression isn't too greedy
content = re.sub(r'&[^ ]+?;', '', content)
# then just match anything that meets your rules
text_list = re.findall(r"[a-zA-Z0-9]+", content)
print(text_list)

Note that 'St Petersburg' likely got matched together because the character between the ‘t’ and ‘P’ probably isn’t a space, but a non-breaking space. If this were just html, I’d expect there to be   or something of the sort, but it’s possible that in your case there’s some UTF non-breaking space character there.

That should not matter with the code above, but if you use a solution using .split(), it likely won’t see that character as a space.

In case the &lt is not your mistake, but in the original, this works as a replacement for the .sub() statement:

content = re.sub(r'&[^ ;]+?(?=[ ;]);?', '', content)

Clearly a bit more complicated: it substitutes any string that starts with & [&], followed by one or more characters that are not a space or ;, taking as little as possible [[^ ;]+?], but only if they are then followed by a space or a ; [(?=[ ;])], and in that case that ; is also matched [;?].

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement