I applied all preprocessing step, but I want to delete the rows that have English words or specific symbols, just i want words in the Arabic language without these symbols or English words that I mention it in below code. I applied the code, but when I print the dataset after cleaning, it still without cleaning! i want to remove it not replace it.
lexicon = pd.read_csv(r"C:UsersUserPython Codedata.csv")
lexicon.head(10)
#output
Vocabulary
0 [PAD]
1 [UNK]
2 [CLS]
3 [SEP]
4 [MASK]
5 !
6 #
7 $
8 %
9 &
lexicon['clean_tweet'] = lexicon.Vocabulary.str.replace('[^ws#@/:%.,_-]', '', flags=re.UNICODE) #removes emojis
lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace('@[_A-Za-z0-9]+', '') #removes handles
lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace('[A-Za-z0-9]+', '') #removes english
lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace('#',' ') #removes hashtag symbol only
lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace(r'httpS+', '', regex=True).replace(r'wwwS+', '', regex=True) #removes links
lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace('d+', '') #removes numbers
lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace('n', ' ') #removes new line
lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace('_', '') #removes underscore
lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace('[^ws]','') #removes punctuation
lexicon.head(10)
# Vocabulary clean_tweet
0 [PAD]
1 [UNK]
2 [CLS]
3 [SEP]
4 [MASK]
5 !
6 #
7 $
8 %
9 &
I want to remove all rows that contain these symbols or any language, just I need arabic word, or is there another simple way to detect the Arabic words only?
note: if the row contains Arabic words and symbols, just I want to delete symbols without Arabic words.
Advertisement
Answer
Going by this SO answer, a Unicode regex range for Arabic letters is:
[u0627-u064a]
We can try using the negative version of this character class along with str.replace
:
lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace(r'[^u0627-u064a]', '')
If you want to spare whitespace characters or other punctuation symbols, then you could try using this regex:
[^u0627-u064as!?.-]