I have two dataframes:
In df1, i got an order of id’s assigned to people, each person can have at most 2 id’s:
df1:
id1 id2 2040 0 2041 2050 2042 0 2043 0 2044 2051 2045 2052
In df2, i got a list of payments and id’s for these people but not arranged:
df2:
id amount 2040 10 2040 10 2053 5 2043 5 2052 10 2045 5
What i’m looking for is a way to create a df3 that organizes payments in the specific order of the df1 taking in consideration the possibillity of a person paying more than once: Example:
df3:
id1 id2 payment 2040 0 20 2041 2050 0 2042 0 0 2043 0 5 2044 2051 0 2045 2052 15
and a df4 if there is any id in df2 that doesn’t exist in df1 for future correction
df4:
id amount 2053 5
Advertisement
Answer
UPDATED : Dataframe.replace
df3 = df1.assign(payment=df1.replace(df2.groupby('id')['amount'].sum()) .where(df1.isin(df2['id'].tolist())) .sum(axis=1))
%%timeit df3 = df1.assign(payment=df1.replace(df2.groupby('id')['amount'].sum()) .where(df1.isin(df2['id'].tolist())).sum(axis=1)) 4.37 ms ± 500 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) %%timeit df3 = df1.assign(payment=df1.reset_index().melt('index') .assign(value=lambda x: x.value.map(df2.groupby('id')['amount'] .sum())) .groupby('index')['value'].sum()) 8.83 ms ± 1.95 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
I like to use DataFrame.melt
because we don’t depend on the number of columns in df1
df3 = df1.assign(payment=df1.reset_index().melt('index') .assign(value=lambda x: x.value.map(df2.groupby('id') .amount .sum())) .groupby('index')['value'].sum()) print(df3) id1 id2 payment 0 2040 0 20.0 1 2041 2050 0.0 2 2042 0 0.0 3 2043 0 5.0 4 2044 2051 0.0 5 2045 2052 15.0
And df4:
df4 = df2.merge(df1.melt(), left_on='id', right_on='value', indicator=True, how='left') .loc[lambda x: x._merge=='left_only', df2.columns] print(df4) id amount 2 2053 5