I’m trying to figure out the most efficient way to join two dataframes such as below.
I’ve tried pd.merge and maybe using the rank function but cannot seem to figure a way.
Thanks in advance
df1
| A | B | C | | -------- | -------------- |------------| | TBK1 | 2022-01-01 |2022-04-04 | | TBK1 | 2022-02-02 |2021-01-09 | | TBK3 | 2022-05-07 |2023-02-04 |
What I’m trying to achieve is this
df2
| A | B | C | D | E | | -------- | -------------- |------------|----------------|------------| | TBK1 | 2022-01-01 |2022-04-04 | 2022-02-02 |2021-01-09 | | TBK3 | 2022-05-07 |2023-02-04 |NaN |NaN |
Advertisement
Answer
You might want to use groupby
with unstack
as advised in this answer:
import pandas as pd from string import ascii_uppercase # Reproduce the data df = pd.DataFrame() df['A'] = ['TBK1','TBK1', 'TBK3'] df['B'] = ['2022-01-01' , '2022-02-02', '2022-05-07'] df['C'] = ['2022-04-04', '2021-01-09', '2023-02-04'] # Count how many rows exists per unique entry s = df.groupby(['A']).cumcount() # Unstack df1 = df.set_index(['A', s]).unstack().sort_index(level=1, axis=1) # Rename columns df1.columns = [l for l in ascii_uppercase[1:len(df1.columns)+1]] # Flatten columns names (aesthetics) df1 = df1.reset_index() print(df1) A B C D E 0 TBK1 2022-01-01 2022-04-04 2022-02-02 2021-01-09 1 TBK3 2022-05-07 2023-02-04 NaN NaN