I want to groupby id and count the unique grade and return max

Question

I have this data and try to solve the following question. DataFrame_from_Scratch = spark.createDataFrame(values, columns) DataFrame_from_Scratch.show() groupby id and count unique grade what is the maximum groupby id and date and how many unique date is there Answer Your implementation for the 1st question is correct. Not sure what exactly your question is seeking as an answer But nevertheless, below

Accepted Answer

Your implementation for the 1st question is correct. Not sure what exactly your question is seeking as an answerBut nevertheless, below are the answer for other sub-parts &#8211;Data Preparationcolumns = ['id', 'grade', 'date']values = [('101','good','2022/06/01'),  ('102','good','2022/06/01'), ('103','fail','2022/06/02'),  ('104','poor','2022/06/02'),('101','good','2022/06/08'),  ('101','excellent','2022/06/14'),('102','poor','2022/06/10'),  ('104','good','2022/06/09'),('102','poor','2022/06/13'),  ('103','fail','2022/06/14')]sparkDF = sql.createDataFrame(values,columns)sparkDF.show()+---+---------+----------+| id|    grade|      date|+---+---------+----------+|101|     good|2022/06/01||102|     good|2022/06/01||103|     fail|2022/06/02||104|     poor|2022/06/02||101|     good|2022/06/08||101|excellent|2022/06/14||102|     poor|2022/06/10||104|     good|2022/06/09||102|     poor|2022/06/13||103|     fail|2022/06/14|+---+---------+----------+Unique Grade CountssparkDF.groupBy(F.col('id')).agg(F.countDistinct(F.col('grade')).alias('Distinct Grade Count')).show()+---+--------------------+| id|Distinct Grade Count|+---+--------------------+|101|                   2||104|                   2||102|                   2||103|                   1|+---+--------------------+Maximum &#8211; Assuming Max DatesparkDF.groupBy(F.col('id')).agg(F.max(F.col('Date')).alias('Max Date')).show()+---+----------+| id|  Max Date|+---+----------+|101|2022/06/14||102|2022/06/13||103|2022/06/14||104|2022/06/09|+---+----------+Unique DatesNot sure about the intention behind this , as this does not makes sense based on your dataset as your granularity level is &#8211; id & DatesparkDF.groupBy(['id','Date']).agg(F.countDistinct(F.col('grade')).alias('Distinct Grade Count')).orderBy('id').show()+---+----------+--------------------+| id|      Date|Distinct Grade Count|+---+----------+--------------------+|101|2022/06/14|                   1||101|2022/06/01|                   1||101|2022/06/08|                   1||102|2022/06/13|                   1||102|2022/06/01|                   1||102|2022/06/10|                   1||103|2022/06/14|                   1||103|2022/06/02|                   1||104|2022/06/09|                   1||104|2022/06/02|                   1|+---+----------+--------------------+

I want to groupby id and count the unique grade and return max

Advertisement

Answer

Data Preparation

Unique Grade Counts

Maximum – Assuming Max Date

Unique Dates