Skip to content
Advertisement

Python regex to remove punctuation except from URLs and decimal numbers

People,

I need a regex to remove punctuation from a string, but keep the accents and URLs. I also have to keep the mentions and hashtags from that string.

I tried with the code below but unfortunately, it replaces the characters with accents but I want to keep the accents.

JavaScript

The output for the following text “Apenas um teste com acentuação. Para pontuação também! #python @stackoverflow http://xyhdhz.com.br” should be “Apenas um teste com acentuação Para pontuação também #python @stackoverflow http://xyhdhz.com.br”

How could I do that?

Advertisement

Answer

You can use Python’s regex module and re.sub() to replace any characters you want to get rid of. You can either use a blacklist and replace all the characters you don’t want, or use a whitelist of all the characters you want to allow and only keep those.

This will remove anything in the bracketed class of characters:

JavaScript

(tested with Python 3.5)

To keep URLs you will have to do a little more processing to check for that format (which is pretty varied). What kind of input/output are you looking for in that case?

edit: based on your added input example:

JavaScript
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement