Python splitting text with line breaks into a list

Question

I&#8217;m trying to convert some text into a list. The text contains special characters, numbers, and line breaks. Ultimately I want to have a list with each word as an item in the list without any special characters, numbers, or spaces. exerpt from text: Currently I&#8217;m using this line to split each word…

Accepted Answer

Since it looks like you&#8217;re parsing html text, it&#8217;s likely all entities are enclosed in & and ;. Removing those makes matching the rest quite easy.import recontent = 'I have no ambition to lose my life on the post-road between St. Petersburgh and Archangel. &lt;the&lt; I'# first, remove entities, the question mark makes sure the expression isn't too greedycontent = re.sub(r'&[^ ]+?;', '', content)# then just match anything that meets your rulestext_list = re.findall(r"[a-zA-Z0-9]+", content)print(text_list)Note that 'St Petersburg' likely got matched together because the character between the &#8216;t&#8217; and &#8216;P&#8217; probably isn&#8217;t a space, but a non-breaking space. If this were just html, I&#8217;d expect there to be &nbsp; or something of the sort, but it&#8217;s possible that in your case there&#8217;s some UTF non-breaking space character there.That should not matter with the code above, but if you use a solution using .split(), it likely won&#8217;t see that character as a space.In case the &lt is not your mistake, but in the original, this works as a replacement for the .sub() statement:content = re.sub(r'&[^ ;]+?(?=[ ;]);?', '', content)Clearly a bit more complicated: it substitutes any string that starts with & [&], followed by one or more characters that are not a space or ;, taking as little as possible [[^ ;]+?], but only if they are then followed by a space or a ; [(?=[ ;])], and in that case that ; is also matched [;?].

Advertisement

Answer