Skip to content
Advertisement

Pyspark agg max function showing different result

I was just studying some pyspark code and didnt understand these particular lines. I have a python code such as below:

emp = [("James", "Java", "4"),
("James", "R", "4"),
("James", "Python", "1"),
("Michael", "Java", "2"),
("Michael", "PHP", "4"),
("Michael", "PHP", "2"),
("Robert", "C#", "3"),
("Robert", "Java", "4"),
("Robert", "R", "1"),
("Washington", "Java", "2")
       ]
empColumns = ["name", "booksInterested", "id"]
empDF = spark.createDataFrame(data=emp, schema=empColumns)
empDF.show()
w = Window.partitionBy('id')
empDF = empDF.withColumn('e_array', F.collect_list('booksInterested').over(w))
empDF.show(truncate=False)
empDF = empDF.agg(F.max('e_array').alias('new_array'))
empDF.show(truncate=False)

When showing empDF after

empDF = empDF.agg(F.max('e_array').alias('new_array'))

Isn’t it supposed to show the longest list? It is showing [Python , R] as the output ? I dont understand how is this output coming?

Advertisement

Answer

Pyspark’s max function returns

the maximum value of the expression in a group

When used on string columns (array columns are internally converted to strings), it returns the maximum value according to alphabetical order: therefore, since all rows in e_array start with a square bracket [, the maximum value for that column will be the one that starts with the last letter in alphabetical order, i.e. the P of Python in this case.


In order to retrieve the longest array in your column, you need to use Pyspark’s size function and then filter based on its values:

# calculate size column
empDF = empDF.withColumn('size', F.size('e_array'))
# compute maximum size
max_size = empDF.agg(F.max('size')).collect()[0][0]
print(max_size)
# 4

# show rows with longest arrays
empDF.filter(F.col('size') == max_size).show(truncate=False)

+-------+---------------+---+--------------------+----+
|name   |booksInterested|id |e_array             |size|
+-------+---------------+---+--------------------+----+
|James  |Java           |4  |[Java, R, PHP, Java]|4   |
|James  |R              |4  |[Java, R, PHP, Java]|4   |
|Michael|PHP            |4  |[Java, R, PHP, Java]|4   |
|Robert |Java           |4  |[Java, R, PHP, Java]|4   |
+-------+---------------+---+--------------------+----+
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement