I was just studying some pyspark code and didnt understand these particular lines. I have a python code such as below:
emp = [("James", "Java", "4"), ("James", "R", "4"), ("James", "Python", "1"), ("Michael", "Java", "2"), ("Michael", "PHP", "4"), ("Michael", "PHP", "2"), ("Robert", "C#", "3"), ("Robert", "Java", "4"), ("Robert", "R", "1"), ("Washington", "Java", "2") ] empColumns = ["name", "booksInterested", "id"] empDF = spark.createDataFrame(data=emp, schema=empColumns) empDF.show() w = Window.partitionBy('id') empDF = empDF.withColumn('e_array', F.collect_list('booksInterested').over(w)) empDF.show(truncate=False) empDF = empDF.agg(F.max('e_array').alias('new_array')) empDF.show(truncate=False)
When showing empDF after
empDF = empDF.agg(F.max('e_array').alias('new_array'))
Isn’t it supposed to show the longest list? It is showing [Python , R] as the output ? I dont understand how is this output coming?
Advertisement
Answer
Pyspark’s max
function returns
the maximum value of the expression in a group
When used on string columns (array columns are internally converted to strings), it returns the maximum value according to alphabetical order: therefore, since all rows in e_array
start with a square bracket [
, the maximum value for that column will be the one that starts with the last letter in alphabetical order, i.e. the P
of Python
in this case.
In order to retrieve the longest array in your column, you need to use Pyspark’s size
function and then filter based on its values:
# calculate size column empDF = empDF.withColumn('size', F.size('e_array')) # compute maximum size max_size = empDF.agg(F.max('size')).collect()[0][0] print(max_size) # 4 # show rows with longest arrays empDF.filter(F.col('size') == max_size).show(truncate=False) +-------+---------------+---+--------------------+----+ |name |booksInterested|id |e_array |size| +-------+---------------+---+--------------------+----+ |James |Java |4 |[Java, R, PHP, Java]|4 | |James |R |4 |[Java, R, PHP, Java]|4 | |Michael|PHP |4 |[Java, R, PHP, Java]|4 | |Robert |Java |4 |[Java, R, PHP, Java]|4 | +-------+---------------+---+--------------------+----+