Remove unicode encoded emojis from Twitter tweet

Question

For a data science project I am tasked with the cleanup of our twitter data. The tweets contain unicode encoded emojis (and other stuff) in the form of ud83dudcf8 (camera emoji) or ud83cuddebud83cuddf7 (french flag) for example. I am using the python-package "re" and so far I was successful in removing "simple" unicodes like u201c (double quotation mark) with something

Accepted Answer

My suggestion would be to create an array of values you would like to replace and you need to escape the  by adding another backslash, or adding &#8216;ur&#8217; before your string so backslashes do not need to be escaped.import reto_remove_arr = [u"ud83dudcf8", u"ud83cuddebud83cuddf7"]pattern_str = "|".join(to_remove_arr)    text = re.sub(pattern_str, "", text)Edit: the above solution will remove specific unicode characters &#8211; to remove all non-ASCII Unicode characters:text = text.encode("ascii", "ignore").decode()Edit: to remove only emojis I found:def strip_emoji(text):    RE_EMOJI = re.compile(u'([U00002600-U000027BF])|([U0001f300-U0001f64F])|([U0001f680-U0001f6FF])')    return RE_EMOJI.sub(r'', text)

Advertisement

Answer