I have a table which has the information as shown in the table from a Pyspark dataframe
I need to perform a data visualization by plotting the number of completed studies each month in a given year. I am of the opinion that each completed (taken from the status
column) will be matched against each of the months of the year, and be aggregated per year.
The challenge I am faced with is how to aggregate each of the completed against the months and subsequently in the year and then plot the data. The values of the data in the column are strings and converting them to integers is another challenge.
I understand that I need to have the status with the value "Completed"
and aggregate them as per years to have two columns that would be plotted as x and y.
Any idea on how this can be achieved is appreciated.
Advertisement
Answer
If I understood your question correctly, you are looking for something like the following: here you first convert the string values to an actual date column with the to_date
function, then you group by such date column and perform a count of completed studies in that month-year combination.
import pyspark.sql.functions as F new_df = df .filter((F.col('Status') == 'Completed') & (F.col('Completion').isNotNull())) .withColumn('date', F.to_date('Completion', 'MMM yyyy')) .groupby('date') .count() .sort('date')
Example
# this is sample data similar to yours, with only the relevant columns df = spark.createDataFrame([ ('Completed', 'Jul 2020'), ('Completed', 'Jan 2018'), ('Completed', 'Dec 2018'), ('Completed', 'Jul 2020'), ('Completed', 'Jan 2018'), ('Completed', 'Jan 2020'), ('Completed', None) ], ('Status', 'Completion')) new_df = df .filter((F.col('Status') == 'Completed') & (F.col('Completion').isNotNull())) .withColumn('date', F.to_date('Completion', 'MMM yyyy')) .groupby('date') .count() .sort('date') new_df.show() +----------+-----+ | date|count| +----------+-----+ |2018-01-01| 2| |2018-12-01| 1| |2020-01-01| 1| |2020-07-01| 2| +----------+-----+