I merged 3 different CSV(D1,D2,D3) Netflow datasets and created one big dataset(df), and applied KMeans clustering to this dataset. To merge them I did not use pd.concat because of memory error and solved with Linux terminal.
JavaScript
x
23
23
1
df = pd.read_csv('D.csv')
2
#D is already created in a Linux machine from terminal
3
4
..
5
KMeans Clustering
6
..
7
8
As a result of clustering, I separated the clusters into a dataframe
9
then created a csv file.
10
cluster_0 = df[df['clusters'] == 0]
11
cluster_1 = df[df['clusters'] == 1]
12
cluster_2 = df[df['clusters'] == 2]
13
14
cluster_0.to_csv('cluster_0.csv')
15
cluster_1.to_csv('cluster_1.csv')
16
cluster_2.to_csv('cluster_2.csv')
17
18
#My goal is to understand the number of same rows with clusters
19
#and D1-D2-D3
20
D1 = pd.read_csv('D1.csv')
21
D2 = pd.read_csv('D2.csv')
22
D3 = pd.read_csv('D3.csv')
23
All these datasets contain the same column names, they have 12 columns(all numerical values)
Example expected result:
cluster_0 has xxxx numbers of same rows from D1, xxxxx numbers of same rows from D2, xxxxx numbers of same rows from D3?
Advertisement
Answer
JavaScript
1
14
14
1
cluster0_D1 = pd.merge(D1, cluster_0, how ='inner')
2
number_of_rows_D1 = len(cluster0_D1)
3
4
cluster0_D2 = pd.merge(D2, cluster_0, how ='inner')
5
number_of_rows_D2 = len(cluster0_D2)
6
7
cluster0_D3 = pd.merge(D3, cluster_0, how ='inner')
8
number_of_rows_D3 = len(cluster0_D3)
9
10
print("How many samples belong to D1, D2, D3 for cluster_0?")
11
print("D1: ",number_of_rows_D1)
12
print("D2: ",number_of_rows_D2)
13
print("D3: ",number_of_rows_D3)
14