Skip to content
Advertisement

Why is Python re not splitting multiple instances of punctuation?

I am trying to split inputted text at spaces, and all special characters like punctuation, while keeping the delimiters. My re pattern works exactly the way I want except that it will not split multiple instances of the punctuation. Here is my re pattern wordsWithPunc = re.split(r'([^-w]+)',words)

If I have a word like “hello” with two punctuation marks after it then those punctuation marks are split but they remain as the same element. For example "hello,-" will equal "hello",",-" but I want it to be "hello",",","-"

Another example. My name is mud!!! would be split into "My","name","is","mud","!!!" but I want it to be "My","name","is","mud","!","!","!"

Advertisement

Answer

You need to make your pattern non-greedy (remove the +) if you want to capture single non-word characters, something like:

import re

words = 'My name is mud!!!'
splitted = re.split(r'([^-w])', words)
# ['My', ' ', 'name', ' ', 'is', ' ', 'mud', '!', '', '!', '', '!', '']

This will produce also ’empty’ matches between non-word characters (because you’re slitting on each of them), but you can mitigate that by postprocessing the result to remove empty matches:

splitted = [match for match in re.split(r'([^-w])', words) if match]
# ['My', ' ', 'name', ' ', 'is', ' ', 'mud', '!', '!', '!']

You can further strip spaces in the generator (i.e. ... if match.strip() ...) if you want to get rid off the space matches as well.

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement