I am processing information in several Pandas DataFrames with 10,000+ rows.
I have…
df1, student information
Class Number Student ID 0 13530159 201733468 1 13530159 201736271 2 13530159 201833263 3 13530159 201931506 4 13530159 201933329 ...
df2, student responses
title time stu_id score 0 Unit 12 - Reading Homework 10/30/2020 22:06:53 202031164 100 1 Unit 10 - Vocabulary Homework 11/1/2020 21:07:44 202031674 100 2 Unit 10 - Vocabulary Homework 11/3/2020 17:20:55 202032311 100 3 Unit 12 - Reading Homework 11/6/2020 6:04:37 202031164 95 4 Unit 12 - Reading Homework 11/7/2020 5:49:15 202031164 90 ...
I want…
a DataFrame with columns for the class number, student ID, and unique assignment titles. The assignment columns should contain the students’ highest score for that assignment. There can be 20+ assignments / columns. A student can have many different scores for a single assignment. I only want the highest. I also want to omit scores submitted after a specific date.
df3, highest student grades
Class Number Student ID Unit 12 - Reading Homework Unit 10 - Vocabulary Homework ... 0 13530159 201733468 100 85 ... 1 13530159 201736271 95 70 ... 2 13530159 201833263 75 65 ... 3 13530159 201931506 80 85 ... 4 13530159 201933329 65 75 ... ...
What is the most efficient way? I will do this several dozen times.
PS, the DataFrames are based on 50+ Google Sheets. I could go back and compile a new DataFrame from the original sheets, but this is time consuming. I’m hoping there is an easier, faster way.
PPS, I’ve read similar questions: Pandas: efficient way to combine dataframes, Pandas apply a function of multiple columns, row-wise, Conditionally fill column values based on another columns value in pandas, etc. None of them specifically address my question.
Advertisement
Answer
Of course I don’t have your data, so I have to “fake” some data but this should work:
import numpy import pandas import random # Student info df_1 = pandas.DataFrame( [ {"Class Number": random.randint(13530159, 13530259), "Student ID": student_id} for student_id in range(201733468, 201735468) ] ) # Student responses df_2 = pandas.DataFrame( [ { "title": f"Unit {random.randint(1, 10)} - ...", "time": pandas.Timestamp(random.randint(1577870112, 1606814112), unit="s"), "stu_id": random.randint(201733468, 201735468), "score": random.randint(10, 100), } for _ in range(10000) ] ) # Merge the two dataframes together df = df_1.merge(df_2, left_on="Student ID", right_on="stu_id") # Create a pivot table, using the "max" as an aggregation function result = pandas.pivot_table(df, index=["Class Number", "Student ID"], columns="title", values="score", aggfunc=numpy.max).reset_index()
Output:
title Class Number Student ID Unit 1 - ... Unit 10 - ... Unit 2 - ... 0 13530159 201733485 NaN NaN NaN 1 13530159 201733705 NaN NaN 16.0 2 13530159 201734020 NaN 92.0 67.0 3 13530159 201734028 100.0 42.0 NaN 4 13530159 201734218 NaN 50.0 41.0 ... ... ... ... ... ... 1989 13530259 201734501 NaN 19.0 32.0 1990 13530259 201734760 NaN NaN NaN 1991 13530259 201734954 NaN NaN NaN 1992 13530259 201735137 NaN NaN 83.0 1993 13530259 201735266 NaN 26.0 NaN title Unit 3 - ... Unit 4 - ... Unit 5 - ... Unit 6 - ... 0 45.0 NaN NaN 39.0 1 46.0 NaN NaN NaN 2 NaN 89.0 88.0 NaN 3 NaN NaN NaN NaN 4 100.0 NaN NaN 88.0 ... ... ... ... ... 1989 NaN NaN 48.0 NaN 1990 33.0 NaN NaN NaN 1991 NaN NaN NaN 74.0 1992 NaN NaN NaN 13.0 1993 35.0 62.0 NaN 43.0 title Unit 7 - ... Unit 8 - ... Unit 9 - ... 0 NaN 65.0 65.0 1 NaN NaN NaN 2 90.0 NaN 88.0 3 NaN 16.0 92.0 4 NaN 77.0 NaN ... ... ... ... 1989 35.0 94.0 NaN 1990 34.0 NaN 45.0 1991 NaN 21.0 19.0 1992 NaN 99.0 60.0 1993 83.0 51.0 NaN [1994 rows x 12 columns]
NOTE: The output contains a lot of NaN values but that is because I’m generating data randomly. This means that not all students will have a result for all classes. If there is no result for a class the value will be NaN.