I applied all preprocessing step, but I want to delete the rows that have English words or specific symbols, just i want words in the Arabic language without these symbols or English words that I mention it in below code. I applied the code, but when I print the dataset after cleaning, it still without cleaning! i want to remove it not replace it.
lexicon = pd.read_csv(r"C:UsersUserPython Codedata.csv") lexicon.head(10) #output Vocabulary 0 [PAD] 1 [UNK] 2 [CLS] 3 [SEP] 4 [MASK] 5 ! 6 # 7 $ 8 % 9 & lexicon['clean_tweet'] = lexicon.Vocabulary.str.replace('[^ws#@/:%.,_-]', '', flags=re.UNICODE) #removes emojis lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace('@[_A-Za-z0-9]+', '') #removes handles lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace('[A-Za-z0-9]+', '') #removes english lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace('#',' ') #removes hashtag symbol only lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace(r'httpS+', '', regex=True).replace(r'wwwS+', '', regex=True) #removes links lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace('d+', '') #removes numbers lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace('n', ' ') #removes new line lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace('_', '') #removes underscore lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace('[^ws]','') #removes punctuation lexicon.head(10) # Vocabulary clean_tweet 0 [PAD] 1 [UNK] 2 [CLS] 3 [SEP] 4 [MASK] 5 ! 6 # 7 $ 8 % 9 &
I want to remove all rows that contain these symbols or any language, just I need arabic word, or is there another simple way to detect the Arabic words only?
note: if the row contains Arabic words and symbols, just I want to delete symbols without Arabic words.
Advertisement
Answer
Going by this SO answer, a Unicode regex range for Arabic letters is:
[u0627-u064a]
We can try using the negative version of this character class along with str.replace
:
lexicon['clean_tweet'] = lexicon.clean_tweet.str.replace(r'[^u0627-u064a]', '')
If you want to spare whitespace characters or other punctuation symbols, then you could try using this regex:
[^u0627-u064as!?.-]