Skip to content
Advertisement

I want to groupby id and count the unique grade and return max

I have this data and try to solve the following question.

DataFrame_from_Scratch = spark.createDataFrame(values, columns) DataFrame_from_Scratch.show()

columns = ['id', 'grade', 'date']
values = [('101','good','2022/06/01'), 
 ('102','good','2022/06/01'), ('103','fail','2022/06/02'), 
 ('104','poor','2022/06/02'),('101','good','2022/06/08'), 
 ('101','excellent','2022/06/14'),('102','poor','2022/06/10'), 
 ('104','good','2022/06/09'),('102','poor','2022/06/13'), 
 ('103','fail','2022/06/14')]

df.groupby(F.col('id')).agg(F.countDistinct(F.col('grade'))).show()
  1. groupby id and count unique grade
  2. what is the maximum
  3. groupby id and date and how many unique date is there

Advertisement

Answer

Your implementation for the 1st question is correct. Not sure what exactly your question is seeking as an answer

But nevertheless, below are the answer for other sub-parts –

Data Preparation

columns = ['id', 'grade', 'date']
values = [('101','good','2022/06/01'), 
 ('102','good','2022/06/01'), ('103','fail','2022/06/02'), 
 ('104','poor','2022/06/02'),('101','good','2022/06/08'), 
 ('101','excellent','2022/06/14'),('102','poor','2022/06/10'), 
 ('104','good','2022/06/09'),('102','poor','2022/06/13'), 
 ('103','fail','2022/06/14')]

sparkDF = sql.createDataFrame(values,columns)

sparkDF.show()

+---+---------+----------+
| id|    grade|      date|
+---+---------+----------+
|101|     good|2022/06/01|
|102|     good|2022/06/01|
|103|     fail|2022/06/02|
|104|     poor|2022/06/02|
|101|     good|2022/06/08|
|101|excellent|2022/06/14|
|102|     poor|2022/06/10|
|104|     good|2022/06/09|
|102|     poor|2022/06/13|
|103|     fail|2022/06/14|
+---+---------+----------+

Unique Grade Counts

sparkDF.groupBy(F.col('id')).agg(F.countDistinct(F.col('grade')).alias('Distinct Grade Count')).show()

+---+--------------------+
| id|Distinct Grade Count|
+---+--------------------+
|101|                   2|
|104|                   2|
|102|                   2|
|103|                   1|
+---+--------------------+

Maximum – Assuming Max Date

sparkDF.groupBy(F.col('id')).agg(F.max(F.col('Date')).alias('Max Date')).show()

+---+----------+
| id|  Max Date|
+---+----------+
|101|2022/06/14|
|102|2022/06/13|
|103|2022/06/14|
|104|2022/06/09|
+---+----------+

Unique Dates

Not sure about the intention behind this , as this does not makes sense based on your dataset as your granularity level is – id & Date

sparkDF.groupBy(['id','Date']).agg(F.countDistinct(F.col('grade')).alias('Distinct Grade Count')).orderBy('id').show()

+---+----------+--------------------+
| id|      Date|Distinct Grade Count|
+---+----------+--------------------+
|101|2022/06/14|                   1|
|101|2022/06/01|                   1|
|101|2022/06/08|                   1|
|102|2022/06/13|                   1|
|102|2022/06/01|                   1|
|102|2022/06/10|                   1|
|103|2022/06/14|                   1|
|103|2022/06/02|                   1|
|104|2022/06/09|                   1|
|104|2022/06/02|                   1|
+---+----------+--------------------+
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement