Skip to content
Advertisement

How to group data from a list of namedtuples

In python, I have the following data in a list of namedtuple in memory:

from collections import namedtuple
perfvalues=[]
perfitem = namedtuple('perfitem', 'cluster host database diskgroup disk read_bytes_per_sec write_bytes_per_sec avg_ms_per_read avg_ms_per_write')
item1=perfitem('cluster1', 'host1', 'database1', 'dg_data1', 'disk1', 650000, 500000, 1.2, 0.9)
item2=perfitem('cluster1', 'host1', 'database1', 'dg_data1', 'disk2', 630000, 480000, 1.1, 1)
item3=perfitem('cluster1', 'host1', 'database1', 'dg_data2', 'disk1', 730000, 250000, 0.4, 0.7)
item4=perfitem('cluster1', 'host1', 'database2', 'dg_data1', 'disk1', 320000, 400000, 1, 0.4)
item5=perfitem('cluster1', 'host1', 'database2', 'dg_data2', 'disk1', 550000, 300000, 0.8, 0.8)
item6=perfitem('cluster1', 'host2', 'database3', 'dg_data1', 'disk1', 420000, 310000, 1.2, 0.7)
item7=perfitem('cluster1', 'host2', 'database3', 'dg_data2', 'disk1', 880000, 280000, 0.4, 0.6)
item8=perfitem('cluster1', 'host2', 'database4', 'dg_data5', 'disk1', 440000, 600000, 1, 0.5)
item9=perfitem('cluster1', 'host2', 'database4', 'dg_data2', 'disk1', 490000, 450000, 1, 0.4)
item10=perfitem('cluster1', 'host2', 'database4', 'dg_data2', 'disk2', 410000, 450000, 1, 0.4)
perfvalues.append(item1)
perfvalues.append(item2)
perfvalues.append(item3)
perfvalues.append(item4)
perfvalues.append(item5)
perfvalues.append(item6)
perfvalues.append(item7)
perfvalues.append(item8)
perfvalues.append(item9)
print(perfvalues)

I want to group the data by :

  • cluster
  • cluster and host
  • cluster and host and database
  • cluster and host and database and diskgroup

I won’t need the disk details.

In each group I want to :

  • sum the values of read_bytes_per_sec and write_bytes_per_sec
  • compute the average of the values of avg_ms_per_read and avg_ms_per_write

As a result I want a list containing :

group='per_diskgroup', cluster='cluster1', host='host1', db='database1', dg='dg_data1', read_bytes_per_sec=sum, write_bytes_per_sec=sum, avg_ms_per_read=avg, avg_ms_per_write=avg
group='per_diskgroup', cluster='cluster1', host='host1', db='database1', dg='dg_data2', read_bytes_per_sec=sum, write_bytes_per_sec=sum, avg_ms_per_read=avg, avg_ms_per_write=avg
group='per_diskgroup', cluster='cluster1', host='host1', db='database2', dg='dg_data1', read_bytes_per_sec=sum, write_bytes_per_sec=sum, avg_ms_per_read=avg, avg_ms_per_write=avg
group='per_diskgroup', cluster='cluster1', host='host1', db='database2', dg='dg_data2', read_bytes_per_sec=sum, write_bytes_per_sec=sum, avg_ms_per_read=avg, avg_ms_per_write=avg
group='per_diskgroup', cluster='cluster1', host='host2', db='database3', dg='dg_data1', read_bytes_per_sec=sum, write_bytes_per_sec=sum, avg_ms_per_read=avg, avg_ms_per_write=avg
group='per_diskgroup', cluster='cluster1', host='host2', db='database3', dg='dg_data2', read_bytes_per_sec=sum, write_bytes_per_sec=sum, avg_ms_per_read=avg, avg_ms_per_write=avg
group='per_diskgroup', cluster='cluster1', host='host2', db='database4', dg='dg_data5', read_bytes_per_sec=sum, write_bytes_per_sec=sum, avg_ms_per_read=avg, avg_ms_per_write=avg
group='per_diskgroup', cluster='cluster1', host='host2', db='database4', dg='dg_data2', read_bytes_per_sec=sum, write_bytes_per_sec=sum, avg_ms_per_read=avg, avg_ms_per_write=avg
group='per_database', cluster='cluster1', host='host1', db='database1', read_bytes_per_sec=sum, write_bytes_per_sec=sum, avg_ms_per_read=avg, avg_ms_per_write=avg
group='per_database', cluster='cluster1', host='host1', db='database2', read_bytes_per_sec=sum, write_bytes_per_sec=sum, avg_ms_per_read=avg, avg_ms_per_write=avg
group='per_database', cluster='cluster1', host='host2', db='database3', read_bytes_per_sec=sum, write_bytes_per_sec=sum, avg_ms_per_read=avg, avg_ms_per_write=avg
group='per_database', cluster='cluster1', host='host2', db='database4', read_bytes_per_sec=sum, write_bytes_per_sec=sum, avg_ms_per_read=avg, avg_ms_per_write=avg 
group='per_host', cluster='cluster1', host='host1', read_bytes_per_sec=sum, write_bytes_per_sec=sum, avg_ms_per_read=avg, avg_ms_per_write=avg
group='per_host', cluster='cluster1', host='host2', read_bytes_per_sec=sum, write_bytes_per_sec=sum, avg_ms_per_read=avg, avg_ms_per_write=avg
group='per_cluster', cluster='cluster1', read_bytes_per_sec=sum, write_bytes_per_sec=sum, avg_ms_per_read=avg, avg_ms_per_write=avg

result may be a list of dict or list of namedtuple I guess. the order in the final list does not matters. I’m using python 3.8.

Any Ideas? Thanks

Advertisement

Answer

These are pandas data frames, if you need python dictionary version you’ll get it by adding .to_dict() to per_diskgroup, per_database, per_host, per_cluster. eg: per_database.to_dict()

In [2]: l = ["cluster1,host1,database1,dg_data1,disk1,650000,500000,1.2,0.9",
   ...: "cluster1,host1,database1,dg_data1,disk2,630000,480000,1.1,1",
   ...: "cluster1,host1,database1,dg_data2,disk1,730000,250000,0.4,0.7",
   ...: "cluster1,host1,database2,dg_data1,disk1,320000,400000,1,0.4",
   ...: "cluster1,host1,database2,dg_data2,disk1,550000,300000,0.8,0.8",
   ...: "cluster1,host2,database3,dg_data1,disk1,420000,310000,1.2,0.7",
   ...: "cluster1,host2,database3,dg_data2,disk1,880000,280000,0.4,0.6",
   ...: "cluster1,host2,database4,dg_data5,disk1,440000,600000,1,0.5",
   ...: "cluster1,host2,database4,dg_data2,disk1,490000,450000,1,0.4",
   ...: "cluster1,host2,database4,dg_data2,disk2,410000,450000,1,0.4"]
   ...: l = [i.split(",") for i in l]
   ...: df = pd.DataFrame(l, columns="cluster host database diskgroup disk read_bytes_per_sec write_bytes_per_sec avg_ms_per_read avg_ms_per_write
   ...: ".split())
   ...: cols = ['read_bytes_per_sec', 'write_bytes_per_sec', 'avg_ms_per_read','avg_ms_per_write']
   ...: df[cols] = df[cols].astype(float)
   ...: df
Out[2]: 
    cluster   host   database diskgroup   disk  read_bytes_per_sec  write_bytes_per_sec  avg_ms_per_read  avg_ms_per_write
0  cluster1  host1  database1  dg_data1  disk1            650000.0             500000.0              1.2               0.9
1  cluster1  host1  database1  dg_data1  disk2            630000.0             480000.0              1.1               1.0
2  cluster1  host1  database1  dg_data2  disk1            730000.0             250000.0              0.4               0.7
3  cluster1  host1  database2  dg_data1  disk1            320000.0             400000.0              1.0               0.4
4  cluster1  host1  database2  dg_data2  disk1            550000.0             300000.0              0.8               0.8
5  cluster1  host2  database3  dg_data1  disk1            420000.0             310000.0              1.2               0.7
6  cluster1  host2  database3  dg_data2  disk1            880000.0             280000.0              0.4               0.6
7  cluster1  host2  database4  dg_data5  disk1            440000.0             600000.0              1.0               0.5
8  cluster1  host2  database4  dg_data2  disk1            490000.0             450000.0              1.0               0.4
9  cluster1  host2  database4  dg_data2  disk2            410000.0             450000.0              1.0               0.4

In [3]: aggregation = {"read_bytes_per_sec" : sum, "avg_ms_per_read" : np.mean, "avg_ms_per_write" : np.mean}
   ...: per_diskgroup = df.groupby(['disk', 'cluster', 'host', 'database', 'diskgroup']).agg(aggregation)
   ...: per_database = df.groupby (['cluster', 'host', 'database']).agg(aggregation)
   ...: per_host = df.groupby     (['host', 'cluster']).agg(aggregation)
   ...: per_cluster = df.groupby  (['cluster']).agg(aggregation)

In [4]: per_diskgroup
Out[4]: 
                                          read_bytes_per_sec  avg_ms_per_read  avg_ms_per_write
disk  cluster  host  database  diskgroup                                                       
disk1 cluster1 host1 database1 dg_data1             650000.0              1.2               0.9
                               dg_data2             730000.0              0.4               0.7
                     database2 dg_data1             320000.0              1.0               0.4
                               dg_data2             550000.0              0.8               0.8
               host2 database3 dg_data1             420000.0              1.2               0.7
                               dg_data2             880000.0              0.4               0.6
                     database4 dg_data2             490000.0              1.0               0.4
                               dg_data5             440000.0              1.0               0.5
disk2 cluster1 host1 database1 dg_data1             630000.0              1.1               1.0
               host2 database4 dg_data2             410000.0              1.0               0.4

In [5]: per_database
Out[5]: 
                          read_bytes_per_sec  avg_ms_per_read  avg_ms_per_write
cluster  host  database                                                        
cluster1 host1 database1           2010000.0              0.9          0.866667
               database2            870000.0              0.9          0.600000
         host2 database3           1300000.0              0.8          0.650000
               database4           1340000.0              1.0          0.433333

In [6]: per_host
Out[6]: 
                read_bytes_per_sec  avg_ms_per_read  avg_ms_per_write
host  cluster                                                        
host1 cluster1           2880000.0             0.90              0.76
host2 cluster1           2640000.0             0.92              0.52

In [7]: per_cluster
Out[7]: 
          read_bytes_per_sec  avg_ms_per_read  avg_ms_per_write
cluster                                                        
cluster1           5520000.0             0.91              0.64
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement