I’ve a pandas dataframe that contains information about meetings. I need to find duplicate entries based on certain conditions. Let’s look at a sample dataframe first.
# create sample dataframe df = pd.DataFrame({'meeting_id': ['m1', 'm2', 'm3', 'm4'], 'brokers': [['JohnSmith-john@gmail.com', 'JaneSmith-jane@gmail.com'], ['RobSmith-rob@gmail.com', 'JaneSmith-jane@gmail.com'], ['RobLee-lee@gmail.com', 'JimSmith-jim@gmail.com'], ['RobLee-lee@gmail.com', 'KimSmith-kim@gmail.com']], 'Start Date': ['2020-01-03', '2020-01-03', '2020-01-03', '2020-01-03']})
The meeting-id is a unique identifier for a meeting. Brokers are the attendees of these meetings.
+----+-------------+-------------------------------------------------------+------------+ | | meeting_id | brokers | Start Date | +----+-------------+-------------------------------------------------------+------------+ | 0 | m1 | [JohnSmith-john@gmail.com, JaneSmith-jane@gmail.com] | 2020-01-03 | | 1 | m2 | [RobSmith-rob@gmail.com, JaneSmith-jane@gmail.com] | 2020-01-03 | | 2 | m3 | [RobLee-lee@gmail.com, JimSmith-jim@gmail.com] | 2020-01-03 | | 3 | m4 | [RobLee-lee@gmail.com, KimSmith-kim@gmail.com] | 2020-01-03 | +----+-------------+-------------------------------------------------------+------------+
The catch is, brokers are not always honest, and many of them can enter same meeting info multiple times. So, what we need to do is check if there’s an overlap in the attendees list on meetings happening on the same day. If so, we need to keep only one, and mark the rest as duplicate. E.g. in our example, row 2 has overlap (JaneSmith-jane@gmail.com) with 1, and row 4 has overlap (RobLee-lee@gmail.com) with row-3. So, we need to mark row 2 and row 4 as duplicates, and change their meeting ids to their corresponding parent meeting’s ids.
I’ve managed to make it work, and I’ll be taking you guys through the code below. But I want to know if there’s any optimization possible on my approach. It’s because this whole logic would run within pandas_udf in pyspark, and my approach uses cartesian product which is expensive. Let’s go through the code.
df.loc[:, 'key'] = 0 # to be used to cartesian product df.loc[:, 'tmp_id'] = range(1, 1+len(df.index))
The intermediate dataframe looks like:
+----+-------------+-------------------------------------------------------+-------------+------+--------+ | | meeting_id | brokers | Start Date | key | tmp_id | +----+-------------+-------------------------------------------------------+-------------+------+--------+ | 0 | m1 | [JohnSmith-john@gmail.com, JaneSmith-jane@gmail.com] | 2020-01-03 | 0 | 1 | | 1 | m2 | [RobSmith-rob@gmail.com, JaneSmith-jane@gmail.com] | 2020-01-03 | 0 | 2 | | 2 | m3 | [RobLee-lee@gmail.com, JimSmith-jim@gmail.com] | 2020-01-03 | 0 | 3 | | 3 | m4 | [RobLee-lee@gmail.com, KimSmith-kim@gmail.com] | 2020-01-03 | 0 | 4 | +----+-------------+-------------------------------------------------------+-------------+------+--------+
Let’s calculate the cartesian product of rows, and find out the duplicates.
df2 = pd.merge(df, df[['key', 'tmp_id', 'brokers']], on='key') df2 = df2[df2['tmp_id_x'] < df2['tmp_id_y']] df2.loc[:, 'overlap'] = df2[['brokers_x', 'brokers_y']].apply(lambda x: len(set(x[0]).intersection(set(x[1]))) > 0, axis=1)
Let’s see what we have now.
+-----+-------------+-------------------------------------------------------+-------------+------+-----------+-----------+-----------------------------------------------------+---------+ | | meeting_id | brokers_x | Start Date | key | tmp_id_x | tmp_id_y | brokers_y | overlap | +-----+-------------+-------------------------------------------------------+-------------+------+-----------+-----------+-----------------------------------------------------+---------+ | 1 | m1 | [JohnSmith-john@gmail.com, JaneSmith-jane@gmail.com] | 2020-01-03 | 0 | 1 | 2 | [RobSmith-rob@gmail.com, JaneSmith-jane@gmail.com] | True | | 2 | m1 | [JohnSmith-john@gmail.com, JaneSmith-jane@gmail.com] | 2020-01-03 | 0 | 1 | 3 | [RobLee-lee@gmail.com, JimSmith-jim@gmail.com] | False | | 3 | m1 | [JohnSmith-john@gmail.com, JaneSmith-jane@gmail.com] | 2020-01-03 | 0 | 1 | 4 | [RobLee-lee@gmail.com, KimSmith-kim@gmail.com] | False | | 6 | m2 | [RobSmith-rob@gmail.com, JaneSmith-jane@gmail.com] | 2020-01-03 | 0 | 2 | 3 | [RobLee-lee@gmail.com, JimSmith-jim@gmail.com] | False | | 7 | m2 | [RobSmith-rob@gmail.com, JaneSmith-jane@gmail.com] | 2020-01-03 | 0 | 2 | 4 | [RobLee-lee@gmail.com, KimSmith-kim@gmail.com] | False | | 11 | m3 | [RobLee-lee@gmail.com, JimSmith-jim@gmail.com] | 2020-01-03 | 0 | 3 | 4 | [RobLee-lee@gmail.com, KimSmith-kim@gmail.com] | True | +-----+-------------+-------------------------------------------------------+-------------+------+-----------+-----------+-----------------------------------------------------+---------+
Let’s modify our original dataframe to mark the duplicate rows, and modify their meeting ids.
df3 = df2[df2['overlap'] == True] # let's get a dict from duplicate meeting's tmp-id to parent meeting id # e.g. {2: 'm1', 4: 'm3'} dup_finder_dict = pd.Series(df3.meeting_id.values, index=df3.tmp_id_y).to_dict() def get_final_id(id1: str, id2: str, dup_dict: dict) -> str: return dup_dict.get(id1, id2) def get_duplicate_flag(id: str, dup_dict: dict) -> str: return id in dup_dict df.loc[:, 'meeting_id'] = df[['tmp_id', 'meeting_id']].apply(lambda x: get_final_id(x[0], x[1], dup_finder_dict), axis=1) df.loc[:, 'is_duplicate'] = df['tmp_id'].apply(lambda x: get_duplicate_flag(x, dup_finder_dict)) df.drop(['key', 'tmp_id'], inplace=True, axis=1)
Now, the final dataframe looks like:
+----+-------------+-------------------------------------------------------+-------------+--------------+ | | meeting_id | brokers | Start Date | is_duplicate | +----+-------------+-------------------------------------------------------+-------------+--------------+ | 0 | m1 | [JohnSmith-john@gmail.com, JaneSmith-jane@gmail.com] | 2020-01-03 | False | | 1 | m1 | [RobSmith-rob@gmail.com, JaneSmith-jane@gmail.com] | 2020-01-03 | True | | 2 | m3 | [RobLee-lee@gmail.com, JimSmith-jim@gmail.com] | 2020-01-03 | False | | 3 | m3 | [RobLee-lee@gmail.com, KimSmith-kim@gmail.com] | 2020-01-03 | True | +----+-------------+-------------------------------------------------------+-------------+--------------+
As I said, this approach works, but is there a more efficient algorithm?
Advertisement
Answer
explode()
the brokers list- generate lists of meetings a broker is associated with, skipping first one (using a slice)
- have a data frame that is the dups,
merge()
back to get the categorisation
df = pd.DataFrame({'meeting_id': ['m1', 'm2', 'm3', 'm4'], 'brokers': [['JohnSmith-john@gmail.com', 'JaneSmith-jane@gmail.com'], ['RobSmith-rob@gmail.com', 'JaneSmith-jane@gmail.com'], ['RobLee-lee@gmail.com', 'JimSmith-jim@gmail.com'], ['RobLee-lee@gmail.com', 'KimSmith-kim@gmail.com']], 'Start Date': ['2020-01-03', '2020-01-03', '2020-01-03', '2020-01-03']}) # duplicates are where broker appears more than once, but not first time they appear # hence construct list with slice [1:] dfdup = (df.explode("brokers") .groupby(["Start Date","brokers"], as_index=False)["meeting_id"].agg(lambda s: list(s[1:])) .explode("meeting_id").dropna() .groupby(["Start Date","meet_id"]).first() # sample data no issue, but remove possibility of dups on merge .assign(duplicated=True) ) # bring it back together df.merge(dfdup.drop(columns="brokers"), on=["Start Date","meeting_id"], how="left").fillna(False)
meeting_id | brokers | Start Date | duplicated | |
---|---|---|---|---|
0 | m1 | [‘JohnSmith-john@gmail.com’, ‘JaneSmith-jane@gmail.com’] | 2020-01-03 | False |
1 | m2 | [‘RobSmith-rob@gmail.com’, ‘JaneSmith-jane@gmail.com’] | 2020-01-03 | True |
2 | m3 | [‘RobLee-lee@gmail.com’, ‘JimSmith-jim@gmail.com’] | 2020-01-03 | False |
3 | m4 | [‘RobLee-lee@gmail.com’, ‘KimSmith-kim@gmail.com’] | 2020-01-03 | True |