I am trying to split inputted text at spaces, and all special characters like punctuation, while keeping the delimiters. My re pattern works exactly the way I want except that it will not split multiple instances of the punctuation.
Here is my re pattern wordsWithPunc = re.split(r'([^-w]+)',words)
If I have a word like “hello” with two punctuation marks after it then those punctuation marks are split but they remain as the same element. For example
"hello,-"
will equal "hello",",-"
but I want it to be "hello",",","-"
Another example. My name is mud!!!
would be split into "My","name","is","mud","!!!"
but I want it to be "My","name","is","mud","!","!","!"
Advertisement
Answer
You need to make your pattern non-greedy (remove the +
) if you want to capture single non-word characters, something like:
import re words = 'My name is mud!!!' splitted = re.split(r'([^-w])', words) # ['My', ' ', 'name', ' ', 'is', ' ', 'mud', '!', '', '!', '', '!', '']
This will produce also ’empty’ matches between non-word characters (because you’re slitting on each of them), but you can mitigate that by postprocessing the result to remove empty matches:
splitted = [match for match in re.split(r'([^-w])', words) if match] # ['My', ' ', 'name', ' ', 'is', ' ', 'mud', '!', '!', '!']
You can further strip spaces in the generator (i.e. ... if match.strip() ...
) if you want to get rid off the space matches as well.