I have a dataset
Name System A AZ A NaN B AZ B NaN B NaN C AY C AY D AZ E AY E AY E NaN F AZ F AZ F NaN
Using this dataset, I clustered the dataset based on the number of times “System” is repeated for a particular “Name”.
In the above example, Names A, B and D have one “AZ” “Subset” while C, E have two “AY” subsets and F has two AZ so it is a different cluster. The clustered form is as follows;
Cluster Names AZ A,B AY,AY C,E AZ,AZ F
Code Used to generate Clusters:
df1 = (df.dropna(subset=['System']) .groupby('Name')['System'] .agg(','.join) .reset_index() .groupby('System')['Name'] .agg(','.join) .rename_axis('Cluster') .reset_index())
Based on this clustered output, I need different dataframes for different clusters: How can I do this?
Output Example-
Dataframe1-
Name System A AZ A NaN B AZ B NaN B NaN
Dataframe2
Name System C AY C AY E AY E AY E NaN
Dataframe 3
Name System F AZ F AZ F NaN
Advertisement
Answer
You can adapt my previous answer:
getting the clusters
clusters = (df.groupby(['Name', 'System']) ['System'].agg(Cluster=lambda x: (x.iloc[0], len(x))) .droplevel('System').reset_index() .groupby('Cluster')['Name'].agg(frozenset) .reset_index() ) # Cluster Name # 0 (AY, 2) (C, E) # 1 (AZ, 1) (A, B, D) # 2 (AZ, 2) (F)
splitting by group
groups = df['Name'].map(clusters.explode('Name').set_index('Name')['Cluster']) for _,d in df.groupby(groups): print(d) # Name System # 5 C AY # 6 C AY # 8 E AY # 9 E AY # 10 E NaN # # Name System # 0 A AZ # 1 A NaN # 2 B AZ # 3 B NaN # 4 B NaN # 7 D AZ # # Name System # 11 F AZ # 12 F AZ # 13 F NaN