Skip to content
Advertisement

Pandas – Compare each row with one another across dataframe and list the amount of duplicate values

I would like to add a column to an existing dataframe that compares every row in the dataframe against each other and list the amount of duplicate values. (I don’t want to remove any of the rows, even if they are entirely duplicated with another row)

The duplicates column should show something like this:

Name Name1 Name2 Name3 Name4 Duplicates

Mark Doug  Jim   Tom   Alex  5
Mark Doug  Jim   Tom   Peter 4
Mark Jim   Doug  Tom   Alex  5
Josh Jesse Jim   Tom   Alex  3
Adam Cam   Max   Matt  James 0

Advertisement

Answer

IIUC, you can convert your dataframe to an array of sets, then use numpy broadcasting to compare each combination (except the diagonal) and get the max intersection:

names = df.agg(set, axis=1)
a = df.agg(set, axis=1).to_numpy()
b = a&a[:,None]
np.fill_diagonal(b, {})
df['Duplicates'] = [max(map(len, x)) for x in b]

output:

   Name  Name1 Name2 Name3  Name4  Duplicates
0  Mark   Doug   Jim   Tom   Alex           5
1  Mark   Doug   Jim   Tom  Peter           4
2  Mark    Jim  Doug   Tom   Alex           5
3  Josh  Jesse   Jim   Tom   Alex           3
4  Adam    Cam   Max  Matt  James           0
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement