Skip to content
Advertisement

Are optimizations possible when finding duplicates in pandas dataframe based on array overlap?

I’ve a pandas dataframe that contains information about meetings. I need to find duplicate entries based on certain conditions. Let’s look at a sample dataframe first.

JavaScript

The meeting-id is a unique identifier for a meeting. Brokers are the attendees of these meetings.

JavaScript

The catch is, brokers are not always honest, and many of them can enter same meeting info multiple times. So, what we need to do is check if there’s an overlap in the attendees list on meetings happening on the same day. If so, we need to keep only one, and mark the rest as duplicate. E.g. in our example, row 2 has overlap (JaneSmith-jane@gmail.com) with 1, and row 4 has overlap (RobLee-lee@gmail.com) with row-3. So, we need to mark row 2 and row 4 as duplicates, and change their meeting ids to their corresponding parent meeting’s ids.

I’ve managed to make it work, and I’ll be taking you guys through the code below. But I want to know if there’s any optimization possible on my approach. It’s because this whole logic would run within pandas_udf in pyspark, and my approach uses cartesian product which is expensive. Let’s go through the code.

JavaScript

The intermediate dataframe looks like:

JavaScript

Let’s calculate the cartesian product of rows, and find out the duplicates.

JavaScript

Let’s see what we have now.

JavaScript

Let’s modify our original dataframe to mark the duplicate rows, and modify their meeting ids.

JavaScript

Now, the final dataframe looks like:

JavaScript

As I said, this approach works, but is there a more efficient algorithm?

Advertisement

Answer

  • explode() the brokers list
  • generate lists of meetings a broker is associated with, skipping first one (using a slice)
  • have a data frame that is the dups, merge() back to get the categorisation
JavaScript
meeting_id brokers Start Date duplicated
0 m1 [‘JohnSmith-john@gmail.com’, ‘JaneSmith-jane@gmail.com’] 2020-01-03 False
1 m2 [‘RobSmith-rob@gmail.com’, ‘JaneSmith-jane@gmail.com’] 2020-01-03 True
2 m3 [‘RobLee-lee@gmail.com’, ‘JimSmith-jim@gmail.com’] 2020-01-03 False
3 m4 [‘RobLee-lee@gmail.com’, ‘KimSmith-kim@gmail.com’] 2020-01-03 True
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement