People,
I need a regex to remove punctuation from a string, but keep the accents and URLs. I also have to keep the mentions and hashtags from that string.
I tried with the code below but unfortunately, it replaces the characters with accents but I want to keep the accents.
import unicodedata if __name__ == "__main__": text = "Apenas um teste com acentuação. Para pontuação também! #python @stackoverflow http://xyhdhz.com.br" text = unicodedata.normalize('NFKD', text).encode('ascii','ignore') print text
The output for the following text “Apenas um teste com acentuação. Para pontuação também! #python @stackoverflow http://xyhdhz.com.br” should be “Apenas um teste com acentuação Para pontuação também #python @stackoverflow http://xyhdhz.com.br”
How could I do that?
Advertisement
Answer
You can use Python’s regex module and re.sub()
to replace any characters you want to get rid of. You can either use a blacklist and replace all the characters you don’t want, or use a whitelist of all the characters you want to allow and only keep those.
This will remove anything in the bracketed class of characters:
import re test = r'#test.43&^%à, è, ì, ò, ù, À, È, Ì, Ò, ÙÃz' out = re.sub(r'[/.!$%^&*()]', '', test) print(out) # Out: #test43à è ì ò ù À È Ì Ò ÙÃz
(tested with Python 3.5)
To keep URLs you will have to do a little more processing to check for that format (which is pretty varied). What kind of input/output are you looking for in that case?
edit: based on your added input example:
test = "Apenas um teste com acentuação. Para pontuação também! #python @stackoverflow" # Out: Apenas um teste com acentuação Para pontuação também #python @stackoverflow