Retrieve top n in each group of a DataFrame in pyspark

Question

There's a DataFrame in pyspark with data as below: What I expect is returning 2 records in each group with the same user_id, which need to have the highest score. Consequently, the result should look as the following: I'm really new to pyspark, could anyone give me a code snippet or portal to the related documentation of this problem? Great

Accepted Answer

I believe you need to use window functions to attain the rank of each row based on user_id and score, and subsequently filter your results to only keep the first two values.from pyspark.sql.window import Windowfrom pyspark.sql.functions import rank, colwindow = Window.partitionBy(df['user_id']).orderBy(df['score'].desc())df.select('*', rank().over(window).alias('rank'))   .filter(col('rank') <= 2)   .show() #+-------+---------+-----+----+#|user_id|object_id|score|rank|#+-------+---------+-----+----+#| user_1| object_1|    3|   1|#| user_1| object_2|    2|   2|#| user_2| object_2|    6|   1|#| user_2| object_1|    5|   2|#+-------+---------+-----+----+In general, the official programming guide is a good place to start learning Spark.Datardd = sc.parallelize([("user_1",  "object_1",  3),                       ("user_1",  "object_2",  2),                       ("user_2",  "object_1",  5),                       ("user_2",  "object_2",  2),                       ("user_2",  "object_2",  6)])df = sqlContext.createDataFrame(rdd, ["user_id", "object_id", "score"])

Advertisement

Answer

Data