Skip to content
Advertisement

Removing non-ascii and special character in pyspark dataframe column

I am reading data from csv files which has about 50 columns, few of the columns(4 to 5) contain text data with non-ASCII characters and special characters.

JavaScript

I am trying to remove all the non-Ascii and special characters and keep only English characters, and I tried to do it as below

JavaScript

There are no spaces in my column name. I receive an error

JavaScript

Is there an alternative to accomplish this, appreciate any help with this.

Advertisement

Answer

This should work.

First creating a temporary example dataframe:

JavaScript

Output

JavaScript

Now to write a UDF because those functions that you use cannot be directly performed on a column type and you will get the Column object not callable error

Solution

JavaScript

Output

JavaScript
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement