I am trying to make a cluster of the following pandas data frame and trying to give the names. E.g – “Personal Info” is cluster name and it consist of (PERSON,LOCATION,PHONE_NUMBER,EMAIL_ADDRESS,PASSPORT,SSN, DRIVER_LICENSE) and also addition of there Counts. which will be 460.
Clusters:
for reference I am providing clusters structure
Input data:
Names Counts CREDIT_CARD 10 CRYPTO 20 DATE_TIME 28 DOMAIN_NAME 40 EMAIL_ADDRESS 45 IBAN_CODE 20 IP_ADDRESS 100 NRP 38 LOCATION 36 PERSON 90 PHONE_NUMBER 105 BANK_NUMBER 29 DRIVER_LICENSE 45 ITIN 38 PASSPORT 49 SSN 90 NHS 0
Output:
Cluster names Total count Personal Info (90+36+105+45+49+90) = 460 Finance (10+29+38+20) = 97 Network (100+40) = 140 Others (20+28) = 48 Info (0) = 0
Advertisement
Answer
You can create an inverse dictionary and map:
d = {'personal_info': ['PERSON','LOCATION','PHONE_NUMBER','EMAIL_ADDRESS','PASSPORT','SSN','DRIVER_LICENSE'], 'finance':['CREDIT_CARD','BANK_NUMBER','ITIN','IBAN_CODE'], 'info': ['NHS'], 'network':['IP_ADDRESS','DOMAIN_NAME'], 'others':['CRYPTO','DATE_TIME','NRP'] } d_inv = {x:k for k, v in d.items() for x in v} (df['Counts'].groupby(df['PII'].map(d_inv)).sum() .rename_axis('Cluster names') # rename to match output .reset_index(name='Total count') )
Output:
Cluster names Total count 0 finance 97 1 info 0 2 network 140 3 others 86 4 personal_info 460