I have this data and try to solve the following question.
DataFrame_from_Scratch = spark.createDataFrame(values, columns) DataFrame_from_Scratch.show()
columns = ['id', 'grade', 'date'] values = [('101','good','2022/06/01'), ('102','good','2022/06/01'), ('103','fail','2022/06/02'), ('104','poor','2022/06/02'),('101','good','2022/06/08'), ('101','excellent','2022/06/14'),('102','poor','2022/06/10'), ('104','good','2022/06/09'),('102','poor','2022/06/13'), ('103','fail','2022/06/14')] df.groupby(F.col('id')).agg(F.countDistinct(F.col('grade'))).show()
- groupby id and count unique grade
- what is the maximum
- groupby id and date and how many unique date is there
Advertisement
Answer
Your implementation for the 1st question is correct. Not sure what exactly your question is seeking as an answer
But nevertheless, below are the answer for other sub-parts –
Data Preparation
columns = ['id', 'grade', 'date'] values = [('101','good','2022/06/01'), ('102','good','2022/06/01'), ('103','fail','2022/06/02'), ('104','poor','2022/06/02'),('101','good','2022/06/08'), ('101','excellent','2022/06/14'),('102','poor','2022/06/10'), ('104','good','2022/06/09'),('102','poor','2022/06/13'), ('103','fail','2022/06/14')] sparkDF = sql.createDataFrame(values,columns) sparkDF.show() +---+---------+----------+ | id| grade| date| +---+---------+----------+ |101| good|2022/06/01| |102| good|2022/06/01| |103| fail|2022/06/02| |104| poor|2022/06/02| |101| good|2022/06/08| |101|excellent|2022/06/14| |102| poor|2022/06/10| |104| good|2022/06/09| |102| poor|2022/06/13| |103| fail|2022/06/14| +---+---------+----------+
Unique Grade Counts
sparkDF.groupBy(F.col('id')).agg(F.countDistinct(F.col('grade')).alias('Distinct Grade Count')).show() +---+--------------------+ | id|Distinct Grade Count| +---+--------------------+ |101| 2| |104| 2| |102| 2| |103| 1| +---+--------------------+
Maximum – Assuming Max Date
sparkDF.groupBy(F.col('id')).agg(F.max(F.col('Date')).alias('Max Date')).show() +---+----------+ | id| Max Date| +---+----------+ |101|2022/06/14| |102|2022/06/13| |103|2022/06/14| |104|2022/06/09| +---+----------+
Unique Dates
Not sure about the intention behind this , as this does not makes sense based on your dataset as your granularity level is – id
& Date
sparkDF.groupBy(['id','Date']).agg(F.countDistinct(F.col('grade')).alias('Distinct Grade Count')).orderBy('id').show() +---+----------+--------------------+ | id| Date|Distinct Grade Count| +---+----------+--------------------+ |101|2022/06/14| 1| |101|2022/06/01| 1| |101|2022/06/08| 1| |102|2022/06/13| 1| |102|2022/06/01| 1| |102|2022/06/10| 1| |103|2022/06/14| 1| |103|2022/06/02| 1| |104|2022/06/09| 1| |104|2022/06/02| 1| +---+----------+--------------------+