My pyspark Dataframe which has two columns, ID and count, count column is a dict/Map<str,int>. I want to create another column which is the total of all values of count I have I want something like, Sum of all the values of count column My approach But I am getting grouped by individual Key and then aggregating which is incorrect.

Python

My pyspark Dataframe which has two columns, ID and count, count column is a dict/Map<str,int>. I want to create another column which is the total of all values of count

I have

ID                        count
3004000304    {'A' -> 2, 'B' -> 4, 'C -> 5, 'D' -> 1, 'E' -> 9}
3004002756    {'B' -> 3, 'A' -> 8,'D' -> 3, 'C' -> 8, 'E' -> 1}

JavaScript
​x
 
ID                        count
3004000304    {'A' -> 2, 'B' -> 4, 'C -> 5, 'D' -> 1, 'E' -> 9}
3004002756    {'B' -> 3, 'A' -> 8,'D' -> 3, 'C' -> 8, 'E' -> 1}
​

I want something like, Sum of all the values of count column

ID                        count                                      total_value
3004000304    {'A' -> 2, 'B' -> 4, 'C -> 5, 'D' -> 1, 'E' -> 9}       21 
3004002756    {'B' -> 3, 'A' -> 8,'D' -> 3, 'C' -> 8, 'E' -> 1}       23

JavaScript
 
ID                        count                                      total_value
3004000304    {'A' -> 2, 'B' -> 4, 'C -> 5, 'D' -> 1, 'E' -> 9}       21 
3004002756    {'B' -> 3, 'A' -> 8,'D' -> 3, 'C' -> 8, 'E' -> 1}       23
​
​

My approach

from pyspark.sql import functions as F
df.select(explode("count")).groupBy("key").sum("value").rdd.collectAsMap()

JavaScript
 
from pyspark.sql import functions as F
df.select(explode("count")).groupBy("key").sum("value").rdd.collectAsMap()
​

But I am getting grouped by individual Key and then aggregating which is incorrect.

If it is not possible in Pyspark, is it possible to convert to pandas df and then do it? Any help is much appreciated

Answer

Use the aggregate function to accumulate the map_values.

df = df.withColumn('total_value', F.expr('aggregate(map_values(count), 0 , (acc, x) -> acc + int(x))'))
df.show(truncate=False)

JavaScript
 
df = df.withColumn('total_value', F.expr('aggregate(map_values(count), 0 , (acc, x) -> acc + int(x))'))
df.show(truncate=False)
​

Dataframe – Find sum of all values from dictionary column (row-wise) and then create new column for that Sum

Advertisement

Answer