I’m trying to convert some text into a list. The text contains special characters, numbers, and line breaks. Ultimately I want to have a list with each word as an item in the list without any special characters, numbers, or spaces.
exerpt from text:
I have no ambition to lose my life on the post-road between St. Petersburgh and Archangel. <the< I
Currently I’m using this line to split each word into an item in the list:
text_list = [re.sub(r"[^a-zA-Z0-9]+", ' ', k) for k in content.split(" ")] print(text_list)
This code is leaving in spaces and combining words in each item of the list like below
Result:
['I', 'have', 'no', 'ambition', 'to', 'lose', 'my', 'life', 'on', 'the', 'post road', 'between St ', 'Petersburgh', 'and', 'Archangel ', ' lt the lt I']
I would like to split the words into individual items of the list and remove the string ‘ lt ‘ and numbers from my list items.
Expected result:
['I', 'have', 'no', 'ambition', 'to', 'lose', 'my', 'life', 'on', 'the', 'post', 'road', 'between', 'St', 'Petersburgh', 'and', 'Archangel', 'the' 'I']
Please help me resolve this issue.
Thanks
Advertisement
Answer
Since it looks like you’re parsing html text, it’s likely all entities are enclosed in &
and ;
. Removing those makes matching the rest quite easy.
import re content = 'I have no ambition to lose my life on the post-road between St. Petersburgh and Archangel. <the< I' # first, remove entities, the question mark makes sure the expression isn't too greedy content = re.sub(r'&[^ ]+?;', '', content) # then just match anything that meets your rules text_list = re.findall(r"[a-zA-Z0-9]+", content) print(text_list)
Note that 'St Petersburg'
likely got matched together because the character between the ‘t’ and ‘P’ probably isn’t a space, but a non-breaking space. If this were just html, I’d expect there to be
or something of the sort, but it’s possible that in your case there’s some UTF non-breaking space character there.
That should not matter with the code above, but if you use a solution using .split()
, it likely won’t see that character as a space.
In case the <
is not your mistake, but in the original, this works as a replacement for the .sub()
statement:
content = re.sub(r'&[^ ;]+?(?=[ ;]);?', '', content)
Clearly a bit more complicated: it substitutes any string that starts with &
[&
], followed by one or more characters that are not a space or ;
, taking as little as possible [[^ ;]+?
], but only if they are then followed by a space or a ;
[(?=[ ;])
], and in that case that ;
is also matched [;?
].