Skip to content
Advertisement

PySpark sum all the values of Map column into a new column

I have a dataframe which looks like this

ID           col
1            [item1 -> 0.2, Item2 -> 0.3, item3 -> 0.4]
2            [item2 -> 0.1, Item2 -> 0.7, item3 -> 0.2]

I want to sum of all the row wise decimal values and store into a new column

ID           col                                                total
1            [item1 -> 0.2, Item2 -> 0.3, item3 -> 0.4]          0.9
2            [item2 -> 0.1, Item2 -> 0.7, item3 -> 0.2]          1.0

My approach

df = df.withColumn('total', F.expr('aggregate(map_values(col),0,(acc,x) -> acc + x)'))

This is not working as it says, it can be applied only to int

Advertisement

Answer

data_sdf. 
    withColumn('map_vals', func.map_values('col')). 
    withColumn('sum_of_vals', func.expr('aggregate(map_vals, cast(0 as double), (x, y) -> x + y)'))

Since, your values are of float type, the initial value passed within the aggregate should match the type of the values in the array. So, casting the initial 0 to double instead of using 0 should work fine.

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement