For a data science project I am tasked with the cleanup of our twitter data. The tweets contain unicode encoded emojis (and other stuff) in the form of ud83dudcf8
(camera emoji) or ud83cuddebud83cuddf7
(french flag) for example.
I am using the python-package “re” and so far I was successful in removing “simple” unicodes like u201c
(double quotation mark) with something like
text = re.sub(u'u201c', '', text)
However, when I am trying to remove more complex structures, like for example
text = re.sub(u'ud83dudcf8', '', text) # remove camera emoji text = re.sub(u'ud83cuddebud83cuddf7', '', text) # remove french flag emoji
nothing is happening, no matter if I prefix the string with an ‘u’, an ‘r’ or nothing at all. The unicode remains in the string.
EDIT: Thanks to @Shawn Shroyer’s answer i found out that
text = re.sub(u'\ud83d\udcf8', '', text)
works fine! I just had to escape the backslashes. Now only my second problem remains (see below).
The second problem is that I don’t want to have to specify every single emoji individually, but instead I would like to remove them all in a much simpler fashion, but without removing ALL unicode characters, because I need to retain stuff like u2019
(single quotation mark).
Advertisement
Answer
My suggestion would be to create an array of values you would like to replace and you need to escape the by adding another backslash, or adding ‘ur’ before your string so backslashes do not need to be escaped.
import re to_remove_arr = [u"ud83dudcf8", u"ud83cuddebud83cuddf7"] pattern_str = "|".join(to_remove_arr) text = re.sub(pattern_str, "", text)
Edit: the above solution will remove specific unicode characters – to remove all non-ASCII Unicode characters:
text = text.encode("ascii", "ignore").decode()
Edit: to remove only emojis I found:
def strip_emoji(text): RE_EMOJI = re.compile(u'([U00002600-U000027BF])|([U0001f300-U0001f64F])|([U0001f680-U0001f6FF])') return RE_EMOJI.sub(r'', text)