Most efficient way to combine large Pandas DataFrames based on multiple column values

Tags: , , , ,



I am processing information in several Pandas DataFrames with 10,000+ rows.

I have…

df1, student information

    Class Number Student ID
0   13530159     201733468
1   13530159     201736271
2   13530159     201833263
3   13530159     201931506
4   13530159     201933329
...

df2, student responses

    title                           time                stu_id      score
0   Unit 12 - Reading Homework      10/30/2020 22:06:53 202031164   100
1   Unit 10 - Vocabulary Homework   11/1/2020 21:07:44  202031674   100
2   Unit 10 - Vocabulary Homework   11/3/2020 17:20:55  202032311   100
3   Unit 12 - Reading Homework      11/6/2020 6:04:37   202031164   95
4   Unit 12 - Reading Homework      11/7/2020 5:49:15   202031164   90
...

I want…

a DataFrame with columns for the class number, student ID, and unique assignment titles. The assignment columns should contain the students’ highest score for that assignment. There can be 20+ assignments / columns. A student can have many different scores for a single assignment. I only want the highest. I also want to omit scores submitted after a specific date.

df3, highest student grades

    Class Number Student ID  Unit 12 - Reading Homework   Unit 10 - Vocabulary Homework  ...
0   13530159     201733468   100                          85                             ...              
1   13530159     201736271   95                           70                             ...
2   13530159     201833263   75                           65                             ...
3   13530159     201931506   80                           85                             ...
4   13530159     201933329   65                           75                             ...
...

What is the most efficient way? I will do this several dozen times.

PS, the DataFrames are based on 50+ Google Sheets. I could go back and compile a new DataFrame from the original sheets, but this is time consuming. I’m hoping there is an easier, faster way.

PPS, I’ve read similar questions: Pandas: efficient way to combine dataframes, Pandas apply a function of multiple columns, row-wise, Conditionally fill column values based on another columns value in pandas, etc. None of them specifically address my question.

Answer

Of course I don’t have your data, so I have to “fake” some data but this should work:

import numpy
import pandas
import random

# Student info
df_1 = pandas.DataFrame(
    [
        {"Class Number": random.randint(13530159, 13530259), "Student ID": student_id}
        for student_id in range(201733468, 201735468)
    ]
)

# Student responses
df_2 = pandas.DataFrame(
    [
        {
            "title": f"Unit {random.randint(1, 10)}  - ...",
            "time": pandas.Timestamp(random.randint(1577870112, 1606814112), unit="s"),
            "stu_id": random.randint(201733468, 201735468),
            "score": random.randint(10, 100),
        }
        for _ in range(10000)
    ]
)

# Merge the two dataframes together
df = df_1.merge(df_2, left_on="Student ID", right_on="stu_id")

# Create a pivot table, using the "max" as an aggregation function
result = pandas.pivot_table(df, index=["Class Number", "Student ID"], columns="title", values="score", aggfunc=numpy.max).reset_index()

Output:

title  Class Number  Student ID  Unit 1  - ...  Unit 10  - ...  Unit 2  - ...  
0          13530159   201733485            NaN             NaN            NaN   
1          13530159   201733705            NaN             NaN           16.0   
2          13530159   201734020            NaN            92.0           67.0   
3          13530159   201734028          100.0            42.0            NaN   
4          13530159   201734218            NaN            50.0           41.0   
...             ...         ...            ...             ...            ...   
1989       13530259   201734501            NaN            19.0           32.0   
1990       13530259   201734760            NaN             NaN            NaN   
1991       13530259   201734954            NaN             NaN            NaN   
1992       13530259   201735137            NaN             NaN           83.0   
1993       13530259   201735266            NaN            26.0            NaN   

title  Unit 3  - ...  Unit 4  - ...  Unit 5  - ...  Unit 6  - ...  
0               45.0            NaN            NaN           39.0   
1               46.0            NaN            NaN            NaN   
2                NaN           89.0           88.0            NaN   
3                NaN            NaN            NaN            NaN   
4              100.0            NaN            NaN           88.0   
...              ...            ...            ...            ...   
1989             NaN            NaN           48.0            NaN   
1990            33.0            NaN            NaN            NaN   
1991             NaN            NaN            NaN           74.0   
1992             NaN            NaN            NaN           13.0   
1993            35.0           62.0            NaN           43.0   

title  Unit 7  - ...  Unit 8  - ...  Unit 9  - ...  
0                NaN           65.0           65.0  
1                NaN            NaN            NaN  
2               90.0            NaN           88.0  
3                NaN           16.0           92.0  
4                NaN           77.0            NaN  
...              ...            ...            ...  
1989            35.0           94.0            NaN  
1990            34.0            NaN           45.0  
1991             NaN           21.0           19.0  
1992             NaN           99.0           60.0  
1993            83.0           51.0            NaN  

[1994 rows x 12 columns]

NOTE: The output contains a lot of NaN values but that is because I’m generating data randomly. This means that not all students will have a result for all classes. If there is no result for a class the value will be NaN.



Source: stackoverflow