Skip to content
Advertisement

python pandas get distinct matches in columns

I have a dataframe which looks a bit like what this code gives:

import pandas as pd
data = {'check1':  ['a', 'a', 'b', 'd', 'f', 'f', 'g'],
        'check2': ['b', 'c', 'c', 'e', 'g', 'h', 'h']}
df = pd.DataFrame (data, columns = ['check1','check2'])

What I want to end up with is a list of lists or dataframe or something similar which tells me the distinct matches across both columns in both directions. It’d be something like this:

[['a', 'b', 'c'], ['d', 'e'], ['f', 'g', 'h']]

I have tried to do it but I can’t get it to go both ways and incorporate all matches:

df.groupby('check1').apply(lambda x: x['check2'].unique()).apply(pd.Series).reset_index()

This is the closest I’ve come but it seems a bit of a hack and doesn’t do it in both directions and remove any duplicates. I didn’t know if there was a more logical / elegant way of doing it. I’m not working again till Tuesday but if anyone has any bright ideas before that would be appreciated.

Advertisement

Answer

Try think the same sub-list is a connection, so that it is more like network problem

import networkx as nx
G=nx.from_pandas_edgelist(df, 'check1', 'check2')
l=list(nx.connected_components(G))
l
Out[133]: [{'a', 'b', 'c'}, {'d', 'e'}, {'f', 'g', 'h'}]
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement