I have two dataframes:
In df1, i got an order of id’s assigned to people, each person can have at most 2 id’s:
df1:
id1 id2 2040 0 2041 2050 2042 0 2043 0 2044 2051 2045 2052
In df2, i got a list of payments and id’s for these people but not arranged:
df2:
id amount 2040 10 2040 10 2053 5 2043 5 2052 10 2045 5
What i’m looking for is a way to create a df3 that organizes payments in the specific order of the df1 taking in consideration the possibillity of a person paying more than once: Example:
df3:
id1 id2 payment 2040 0 20 2041 2050 0 2042 0 0 2043 0 5 2044 2051 0 2045 2052 15
and a df4 if there is any id in df2 that doesn’t exist in df1 for future correction
df4:
id amount 2053 5
Advertisement
Answer
UPDATED : Dataframe.replace
df3 = df1.assign(payment=df1.replace(df2.groupby('id')['amount'].sum())
.where(df1.isin(df2['id'].tolist()))
.sum(axis=1))
%%timeit
df3 = df1.assign(payment=df1.replace(df2.groupby('id')['amount'].sum())
.where(df1.isin(df2['id'].tolist())).sum(axis=1))
4.37 ms ± 500 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df3 = df1.assign(payment=df1.reset_index().melt('index')
.assign(value=lambda x: x.value.map(df2.groupby('id')['amount']
.sum()))
.groupby('index')['value'].sum())
8.83 ms ± 1.95 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
I like to use DataFrame.melt because we don’t depend on the number of columns in df1
df3 = df1.assign(payment=df1.reset_index().melt('index')
.assign(value=lambda x: x.value.map(df2.groupby('id')
.amount
.sum()))
.groupby('index')['value'].sum())
print(df3)
id1 id2 payment
0 2040 0 20.0
1 2041 2050 0.0
2 2042 0 0.0
3 2043 0 5.0
4 2044 2051 0.0
5 2045 2052 15.0
And df4:
df4 = df2.merge(df1.melt(),
left_on='id',
right_on='value',
indicator=True,
how='left')
.loc[lambda x: x._merge=='left_only', df2.columns]
print(df4)
id amount
2 2053 5