Skip to content
Advertisement

Most efficient way to combine large Pandas DataFrames based on multiple column values

I am processing information in several Pandas DataFrames with 10,000+ rows.

I have…

df1, student information

JavaScript

df2, student responses

JavaScript

I want…

a DataFrame with columns for the class number, student ID, and unique assignment titles. The assignment columns should contain the students’ highest score for that assignment. There can be 20+ assignments / columns. A student can have many different scores for a single assignment. I only want the highest. I also want to omit scores submitted after a specific date.

df3, highest student grades

JavaScript

What is the most efficient way? I will do this several dozen times.

PS, the DataFrames are based on 50+ Google Sheets. I could go back and compile a new DataFrame from the original sheets, but this is time consuming. I’m hoping there is an easier, faster way.

PPS, I’ve read similar questions: Pandas: efficient way to combine dataframes, Pandas apply a function of multiple columns, row-wise, Conditionally fill column values based on another columns value in pandas, etc. None of them specifically address my question.

Advertisement

Answer

Of course I don’t have your data, so I have to “fake” some data but this should work:

JavaScript

Output:

JavaScript

NOTE: The output contains a lot of NaN values but that is because I’m generating data randomly. This means that not all students will have a result for all classes. If there is no result for a class the value will be NaN.

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement