Check for value of an dataframe exists in another and set values in a specific way accounting for duplicates

Tags: , ,



I have two dataframes

In df1, i got an order of id’s assigned to people, each person can have at most 2 id’s:

df1

id1      id2
2040     0
2041     2050
2042     0
2043     0
2044     2051
2045     2052

In df2, i got a list of payments and id’s for these people but not arranged:

df2

id      amount
2040     10
2040     10
2053     5
2043     5
2052     10
2045     5

What i’m looking for is a way to create a df3 that organizes payments in the specific order of the df1 taking in consideration the possibillity of a person paying more than once: Example:

df3

id1      id2      payment    
2040     0         20          
2041     2050      0            
2042     0         0            
2043     0         5            
2044     2051      0            
2045     2052      15           

and a df4 if there is any id in df2 that doesn’t exist in df1 for future correction

df4

id     amount
2053   5

Thanks in advance.

Answer

UPDATED : Dataframe.replace

df3 = df1.assign(payment=df1.replace(df2.groupby('id')['amount'].sum())
                            .where(df1.isin(df2['id'].tolist()))
                            .sum(axis=1))  

%%timeit
df3 = df1.assign(payment=df1.replace(df2.groupby('id')['amount'].sum())
                            .where(df1.isin(df2['id'].tolist())).sum(axis=1))  
4.37 ms ± 500 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit
df3 = df1.assign(payment=df1.reset_index().melt('index')
                            .assign(value=lambda x: x.value.map(df2.groupby('id')['amount']
                                                                   .sum()))
                            .groupby('index')['value'].sum())
8.83 ms ± 1.95 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

I like to use DataFrame.melt because we don’t depend on the number of columns in df1

df3 = df1.assign(payment=df1.reset_index().melt('index')
                            .assign(value=lambda x: x.value.map(df2.groupby('id')
                                                                   .amount
                                                                   .sum()))
                            .groupby('index')['value'].sum())
print(df3)

    id1   id2  payment
0  2040     0     20.0
1  2041  2050      0.0
2  2042     0      0.0
3  2043     0      5.0
4  2044  2051      0.0
5  2045  2052     15.0

And df4:

df4 = df2.merge(df1.melt(), 
                left_on='id', 
                right_on='value', 
                indicator=True, 
                how='left')
    .loc[lambda x: x._merge=='left_only', df2.columns]

print(df4)

     id  amount
2  2053       5


Source: stackoverflow