I am reading data from csv files which has about 50 columns, few of the columns(4 to 5) contain text data with non-ASCII characters and special characters.
JavaScript
x
2
1
df = spark.read.csv(path, header=True, schema=availSchema)
2
I am trying to remove all the non-Ascii and special characters and keep only English characters, and I tried to do it as below
JavaScript
1
2
1
df = df['textcolumn'].str.encode('ascii', 'ignore').str.decode('ascii')
2
There are no spaces in my column name. I receive an error
JavaScript
1
8
1
TypeError: 'Column' object is not callable
2
---------------------------------------------------------------------------
3
TypeError Traceback (most recent call last)
4
<command-1486957561378215> in <module>
5
----> 1 InvFilteredDF = InvFilteredDF['SearchResultDescription'].str.encode('ascii', 'ignore').str.decode('ascii')
6
7
TypeError: 'Column' object is not callable
8
Is there an alternative to accomplish this, appreciate any help with this.
Advertisement
Answer
This should work.
First creating a temporary example dataframe:
JavaScript
1
9
1
df = spark.createDataFrame([
2
(0, "This is Spark"),
3
(1, "I wish Java could use case classes"),
4
(2, "Data science is cool"),
5
(3, "This is aSA")
6
], ["id", "words"])
7
8
df.show()
9
Output
JavaScript
1
9
1
+---+--------------------+
2
| id| words|
3
+---+--------------------+
4
| 0| This is Spark|
5
| 1|I wish Java could|
6
| 2|Data science is |
7
| 3| This is aSA|
8
+---+--------------------+
9
Now to write a UDF because those functions that you use cannot be directly performed on a column type and you will get the Column object not callable error
Solution
JavaScript
1
9
1
from pyspark.sql.functions import udf
2
3
def ascii_ignore(x):
4
return x.encode('ascii', 'ignore').decode('ascii')
5
6
ascii_udf = udf(ascii_ignore)
7
8
df.withColumn("foo", ascii_udf('words')).show()
9
Output
JavaScript
1
9
1
+---+--------------------+--------------------+
2
| id| words| foo|
3
+---+--------------------+--------------------+
4
| 0| This is Spark| This is Spark|
5
| 1|I wish Java could|I wish Java could|
6
| 2|Data science is |Data science is |
7
| 3| This is aSA| This is aSA|
8
+---+--------------------+--------------------+
9