I have a dataframe which looks a bit like what this code gives:
import pandas as pd data = {'check1': ['a', 'a', 'b', 'd', 'f', 'f', 'g'], 'check2': ['b', 'c', 'c', 'e', 'g', 'h', 'h']} df = pd.DataFrame (data, columns = ['check1','check2'])
What I want to end up with is a list of lists or dataframe or something similar which tells me the distinct matches across both columns in both directions. It’d be something like this:
[['a', 'b', 'c'], ['d', 'e'], ['f', 'g', 'h']]
I have tried to do it but I can’t get it to go both ways and incorporate all matches:
df.groupby('check1').apply(lambda x: x['check2'].unique()).apply(pd.Series).reset_index()
This is the closest I’ve come but it seems a bit of a hack and doesn’t do it in both directions and remove any duplicates. I didn’t know if there was a more logical / elegant way of doing it. I’m not working again till Tuesday but if anyone has any bright ideas before that would be appreciated.
Advertisement
Answer
Try think the same sub-list is a connection, so that it is more like network problem
import networkx as nx G=nx.from_pandas_edgelist(df, 'check1', 'check2') l=list(nx.connected_components(G)) l Out[133]: [{'a', 'b', 'c'}, {'d', 'e'}, {'f', 'g', 'h'}]