Skip to content
Advertisement

Python regex to remove punctuation except from URLs and decimal numbers

People,

I need a regex to remove punctuation from a string, but keep the accents and URLs. I also have to keep the mentions and hashtags from that string.

I tried with the code below but unfortunately, it replaces the characters with accents but I want to keep the accents.

import unicodedata

if __name__ == "__main__":
    text = "Apenas um teste com acentuação. Para pontuação também! #python @stackoverflow http://xyhdhz.com.br" 
    text = unicodedata.normalize('NFKD', text).encode('ascii','ignore')
    print text

The output for the following text “Apenas um teste com acentuação. Para pontuação também! #python @stackoverflow http://xyhdhz.com.br” should be “Apenas um teste com acentuação Para pontuação também #python @stackoverflow http://xyhdhz.com.br”

How could I do that?

Advertisement

Answer

You can use Python’s regex module and re.sub() to replace any characters you want to get rid of. You can either use a blacklist and replace all the characters you don’t want, or use a whitelist of all the characters you want to allow and only keep those.

This will remove anything in the bracketed class of characters:

import re

test = r'#test.43&^%à, è, ì, ò, ù, À, È, Ì, Ò, ÙÃz'
out = re.sub(r'[/.!$%^&*()]', '', test)
print(out)
# Out: #test43à è ì ò ù À È Ì Ò ÙÃz

(tested with Python 3.5)

To keep URLs you will have to do a little more processing to check for that format (which is pretty varied). What kind of input/output are you looking for in that case?

edit: based on your added input example:

test = "Apenas um teste com acentuação. Para pontuação também! #python @stackoverflow" 
# Out: Apenas um teste com acentuação Para pontuação também #python @stackoverflow
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement