Most efficient way to combine large Pandas DataFrames based on multiple column values

Question

I am processing information in several Pandas DataFrames with 10,000+ rows. I have&#8230; df1, student information df2, student responses I want&#8230; a DataFrame with columns for the class number, student ID, and unique assignment titles. The assignment columns should contain the students&#8217; highest sco…

Accepted Answer

Of course I don&#8217;t have your data, so I have to &#8220;fake&#8221; some data but this should work:import numpyimport pandasimport random# Student infodf_1 = pandas.DataFrame(    [        {"Class Number": random.randint(13530159, 13530259), "Student ID": student_id}        for student_id in range(201733468, 201735468)    ])# Student responsesdf_2 = pandas.DataFrame(    [        {            "title": f"Unit {random.randint(1, 10)}  - ...",            "time": pandas.Timestamp(random.randint(1577870112, 1606814112), unit="s"),            "stu_id": random.randint(201733468, 201735468),            "score": random.randint(10, 100),        }        for _ in range(10000)    ])# Merge the two dataframes togetherdf = df_1.merge(df_2, left_on="Student ID", right_on="stu_id")# Create a pivot table, using the "max" as an aggregation functionresult = pandas.pivot_table(df, index=["Class Number", "Student ID"], columns="title", values="score", aggfunc=numpy.max).reset_index()Output:title  Class Number  Student ID  Unit 1  - ...  Unit 10  - ...  Unit 2  - ...  0          13530159   201733485            NaN             NaN            NaN   1          13530159   201733705            NaN             NaN           16.0   2          13530159   201734020            NaN            92.0           67.0   3          13530159   201734028          100.0            42.0            NaN   4          13530159   201734218            NaN            50.0           41.0   ...             ...         ...            ...             ...            ...   1989       13530259   201734501            NaN            19.0           32.0   1990       13530259   201734760            NaN             NaN            NaN   1991       13530259   201734954            NaN             NaN            NaN   1992       13530259   201735137            NaN             NaN           83.0   1993       13530259   201735266            NaN            26.0            NaN   title  Unit 3  - ...  Unit 4  - ...  Unit 5  - ...  Unit 6  - ...  0               45.0            NaN            NaN           39.0   1               46.0            NaN            NaN            NaN   2                NaN           89.0           88.0            NaN   3                NaN            NaN            NaN            NaN   4              100.0            NaN            NaN           88.0   ...              ...            ...            ...            ...   1989             NaN            NaN           48.0            NaN   1990            33.0            NaN            NaN            NaN   1991             NaN            NaN            NaN           74.0   1992             NaN            NaN            NaN           13.0   1993            35.0           62.0            NaN           43.0   title  Unit 7  - ...  Unit 8  - ...  Unit 9  - ...  0                NaN           65.0           65.0  1                NaN            NaN            NaN  2               90.0            NaN           88.0  3                NaN           16.0           92.0  4                NaN           77.0            NaN  ...              ...            ...            ...  1989            35.0           94.0            NaN  1990            34.0            NaN           45.0  1991             NaN           21.0           19.0  1992             NaN           99.0           60.0  1993            83.0           51.0            NaN  [1994 rows x 12 columns]NOTE: The output contains a lot of NaN values but that is because I&#8217;m generating data randomly. This means that not all students will have a result for all classes. If there is no result for a class the value will be NaN.

Advertisement

Answer