Remove unicode encoded emojis from Twitter tweet

Tags: , , , ,



For a data science project I am tasked with the cleanup of our twitter data. The tweets contain unicode encoded emojis (and other stuff) in the form of ud83dudcf8 (camera emoji) or ud83cuddebud83cuddf7 (french flag) for example.

I am using the python-package “re” and so far I was successful in removing “simple” unicodes like u201c (double quotation mark) with something like

text = re.sub(u'u201c', '', text)

However, when I am trying to remove more complex structures, like for example

text = re.sub(u'ud83dudcf8', '', text) # remove camera emoji
text = re.sub(u'ud83cuddebud83cuddf7', '', text) # remove french flag emoji

nothing is happening, no matter if I prefix the string with an ‘u’, an ‘r’ or nothing at all. The unicode remains in the string.

EDIT: Thanks to @Shawn Shroyer’s answer i found out that

text = re.sub(u'\ud83d\udcf8', '', text)

works fine! I just had to escape the backslashes. Now only my second problem remains (see below).

The second problem is that I don’t want to have to specify every single emoji individually, but instead I would like to remove them all in a much simpler fashion, but without removing ALL unicode characters, because I need to retain stuff like u2019 (single quotation mark).

Answer

My suggestion would be to create an array of values you would like to replace and you need to escape the by adding another backslash, or adding ‘ur’ before your string so backslashes do not need to be escaped.

import re
to_remove_arr = [u"ud83dudcf8", u"ud83cuddebud83cuddf7"]
pattern_str = "|".join(to_remove_arr)    
text = re.sub(pattern_str, "", text)

Edit: the above solution will remove specific unicode characters – to remove all non-ASCII Unicode characters:

text = text.encode("ascii", "ignore").decode()

Edit: to remove only emojis I found:

def strip_emoji(text):
    RE_EMOJI = re.compile(u'([U00002600-U000027BF])|([U0001f300-U0001f64F])|([U0001f680-U0001f6FF])')
    return RE_EMOJI.sub(r'', text)


Source: stackoverflow