My pyspark Dataframe which has two columns, ID and count, count column is a dict/Map<str,int>. I want to create another column which is the total of all values of count
I have
JavaScript
x
4
1
ID count
2
3004000304 {'A' -> 2, 'B' -> 4, 'C -> 5, 'D' -> 1, 'E' -> 9}
3
3004002756 {'B' -> 3, 'A' -> 8,'D' -> 3, 'C' -> 8, 'E' -> 1}
4
I want something like, Sum of all the values of count column
JavaScript
1
5
1
ID count total_value
2
3004000304 {'A' -> 2, 'B' -> 4, 'C -> 5, 'D' -> 1, 'E' -> 9} 21
3
3004002756 {'B' -> 3, 'A' -> 8,'D' -> 3, 'C' -> 8, 'E' -> 1} 23
4
5
My approach
JavaScript
1
3
1
from pyspark.sql import functions as F
2
df.select(explode("count")).groupBy("key").sum("value").rdd.collectAsMap()
3
But I am getting grouped by individual Key and then aggregating which is incorrect.
If it is not possible in Pyspark, is it possible to convert to pandas df and then do it? Any help is much appreciated
Advertisement
Answer
Use the aggregate
function to accumulate the map_values
.
JavaScript
1
3
1
df = df.withColumn('total_value', F.expr('aggregate(map_values(count), 0 , (acc, x) -> acc + int(x))'))
2
df.show(truncate=False)
3