Removing non-ascii and special character in pyspark dataframe column

Question

I am reading data from csv files which has about 50 columns, few of the columns(4 to 5) contain text data with non-ASCII characters and special characters. I am trying to remove all the non-Ascii and special characters and keep only English characters, and I tried to do it as below There are no spaces in my column name. I

Accepted Answer

This should work.First creating a temporary example dataframe:df = spark.createDataFrame([    (0, "This is Spark"),    (1, "I wish Java could use case classes"),    (2, "Data science is  cool"),    (3, "This is ï»¿aSA")], ["id", "words"])df.show()Output+---+--------------------+| id|               words|+---+--------------------+|  0|       This is Spark||  1|I wish Java could...||  2|Data science is  ...||  3|      This is ï»¿aSA|+---+--------------------+Now to write a UDF because those functions that you use cannot be directly performed on a column type and you will get the Column object not callable errorSolutionfrom pyspark.sql.functions import udfdef ascii_ignore(x):    return x.encode('ascii', 'ignore').decode('ascii')ascii_udf = udf(ascii_ignore)df.withColumn("foo", ascii_udf('words')).show()Output+---+--------------------+--------------------+| id|               words|                 foo|+---+--------------------+--------------------+|  0|       This is Spark|       This is Spark||  1|I wish Java could...|I wish Java could...||  2|Data science is  ...|Data science is  ...||  3|      This is ï»¿aSA|         This is aSA|+---+--------------------+--------------------+

Advertisement

Answer