I have a dataset
Name System A AZ A NaN B AZ B NaN B NaN C AY C AY D AZ E AY E AY E NaN F AZ F AZ F NaN
Using this dataset, I clustered the dataset based on the number of times “System” is repeated for a particular “Name”.
In the above example, Names A, B and D have one “AZ” “Subset” while C, E have two “AY” subsets and F has two AZ so it is a different cluster. The clustered form is as follows;
Cluster Names AZ A,B AY,AY C,E AZ,AZ F
Code Used to generate Clusters:
df1 = (df.dropna(subset=['System'])
.groupby('Name')['System']
.agg(','.join)
.reset_index()
.groupby('System')['Name']
.agg(','.join)
.rename_axis('Cluster')
.reset_index())
Based on this clustered output, I need different dataframes for different clusters: How can I do this?
Output Example-
Dataframe1-
Name System A AZ A NaN B AZ B NaN B NaN
Dataframe2
Name System C AY C AY E AY E AY E NaN
Dataframe 3
Name System F AZ F AZ F NaN
Advertisement
Answer
You can adapt my previous answer:
getting the clusters
clusters = (df.groupby(['Name', 'System'])
['System'].agg(Cluster=lambda x: (x.iloc[0], len(x)))
.droplevel('System').reset_index()
.groupby('Cluster')['Name'].agg(frozenset)
.reset_index()
)
# Cluster Name
# 0 (AY, 2) (C, E)
# 1 (AZ, 1) (A, B, D)
# 2 (AZ, 2) (F)
splitting by group
groups = df['Name'].map(clusters.explode('Name').set_index('Name')['Cluster'])
for _,d in df.groupby(groups):
print(d)
# Name System
# 5 C AY
# 6 C AY
# 8 E AY
# 9 E AY
# 10 E NaN
#
# Name System
# 0 A AZ
# 1 A NaN
# 2 B AZ
# 3 B NaN
# 4 B NaN
# 7 D AZ
#
# Name System
# 11 F AZ
# 12 F AZ
# 13 F NaN