Skip to content
Advertisement

Remove symbols in dataset

I applied all preprocessing step, but I want to delete the rows that have English words or specific symbols, just i want words in the Arabic language without these symbols or English words that I mention it in below code. I applied the code, but when I print the dataset after cleaning, it still without cleaning! i want to remove it not replace it.

JavaScript

I want to remove all rows that contain these symbols or any language, just I need arabic word, or is there another simple way to detect the Arabic words only?

note: if the row contains Arabic words and symbols, just I want to delete symbols without Arabic words.

Advertisement

Answer

Going by this SO answer, a Unicode regex range for Arabic letters is:

JavaScript

We can try using the negative version of this character class along with str.replace:

JavaScript

If you want to spare whitespace characters or other punctuation symbols, then you could try using this regex:

JavaScript
Advertisement