I would like to add a column to an existing dataframe that compares every row in the dataframe against each other and list the amount of duplicate values. (I don’t want to remove any of the rows, even if they are entirely duplicated with another row)
The duplicates column should show something like this:
Name Name1 Name2 Name3 Name4 Duplicates Mark Doug Jim Tom Alex 5 Mark Doug Jim Tom Peter 4 Mark Jim Doug Tom Alex 5 Josh Jesse Jim Tom Alex 3 Adam Cam Max Matt James 0
Advertisement
Answer
IIUC, you can convert your dataframe to an array of set
s, then use numpy broadcasting to compare each combination (except the diagonal) and get the max intersection:
names = df.agg(set, axis=1) a = df.agg(set, axis=1).to_numpy() b = a&a[:,None] np.fill_diagonal(b, {}) df['Duplicates'] = [max(map(len, x)) for x in b]
output:
Name Name1 Name2 Name3 Name4 Duplicates 0 Mark Doug Jim Tom Alex 5 1 Mark Doug Jim Tom Peter 4 2 Mark Jim Doug Tom Alex 5 3 Josh Jesse Jim Tom Alex 3 4 Adam Cam Max Matt James 0