I have the following data frame i want it to split path column with value of the item column in the same index i've used this udf function it worked very well But, i was wondering if there's another way to do it with pyspark function because i can't use in anyway the "org" to join with another dataframe or

string split with the value of another clumn PySpark

I have the following data frame

+----+-------+
|item|   path|
+----+-------+
| -a-|  a-b-c|
| -b-|  e-b-f|
| -d-|e-b-d-h|
| -c-|  g-h-c|
+----+-------+

i want it to split path column with value of the item column in the same index

+----+--------+
|item|    path|
+----+--------+
| -b-|  [a, c]|
| -b-|  [e, f]|
| -d-|[e-b, h]|
| -c-|[g-h, b]|
+----+--------+

i’ve used this udf function

split_udf = udf(lambda a,b: a.split(b),T.ArrayType(T.StringType()))
org = org.withColumn('crb_url', split_udf('path','item')[0])

it worked very well But, i was wondering if there’s another way to do it with pyspark function because i can’t use in anyway the “org” to join with another dataframe or save it as a delta table it gives me this error

AttributeError: 'NoneType' object has no attribute 'split'

Answer

using .fillna("") to fill null value to “”. Like this:org = org.fillna("").withColumn('crb_url', split_udf('path','item')[0])

Advertisement

Answer