I have a dataframe which looks a bit like what this code gives:
JavaScript
x
5
1
import pandas as pd
2
data = {'check1': ['a', 'a', 'b', 'd', 'f', 'f', 'g'],
3
'check2': ['b', 'c', 'c', 'e', 'g', 'h', 'h']}
4
df = pd.DataFrame (data, columns = ['check1','check2'])
5
What I want to end up with is a list of lists or dataframe or something similar which tells me the distinct matches across both columns in both directions. It’d be something like this:
JavaScript
1
2
1
[['a', 'b', 'c'], ['d', 'e'], ['f', 'g', 'h']]
2
I have tried to do it but I can’t get it to go both ways and incorporate all matches:
JavaScript
1
2
1
df.groupby('check1').apply(lambda x: x['check2'].unique()).apply(pd.Series).reset_index()
2
This is the closest I’ve come but it seems a bit of a hack and doesn’t do it in both directions and remove any duplicates. I didn’t know if there was a more logical / elegant way of doing it. I’m not working again till Tuesday but if anyone has any bright ideas before that would be appreciated.
Advertisement
Answer
Try think the same sub-list is a connection, so that it is more like network problem
JavaScript
1
6
1
import networkx as nx
2
G=nx.from_pandas_edgelist(df, 'check1', 'check2')
3
l=list(nx.connected_components(G))
4
l
5
Out[133]: [{'a', 'b', 'c'}, {'d', 'e'}, {'f', 'g', 'h'}]
6