I have a ~200mil data in dictionary index_data
:
JavaScript
x
10
10
1
index_data = [
2
{3396623046050748: [0, 1],
3
3749192045350356: [2],
4
4605074846433127: [3],
5
112884719857303: [4],
6
507466746864539: [5],
7
..
8
}
9
]
10
Key is a value in CustId and Value is an index of CustID in df_data
:
I have a DataFrame df_data
:
JavaScript
1
8
1
CustID Score Number1 Number2 Phone
2
3396623046050748 2 2 3 0000
3
3396623046050748 6 2 3 0000
4
3749192045350356 1 56 23 2222
5
4605074846433127 67 532 321 3333
6
112884719857303 3 11 66 4444
7
507466746864539 7 22 96 5555
8
NOTE: If CustID
is duplicate, only column Score
have different data in each row
I want to create a new list of dict(Total_Score
is an avg Score
of each CustID, Number
is Number2
divide Number1
):
JavaScript
1
19
19
1
result = [
2
{'CustID' :3396623046050748,
3
'Total_Score': 4,
4
'Number' : 1.5,
5
'Phone' : 0000
6
},
7
{'CustID' :3749192045350356,
8
'Total_Score': 1,
9
'Number' : 0.41,
10
'Phone' : 2222
11
},
12
{'CustID' :4605074846433127,
13
'Total_Score': 67,
14
'Number' : 0.6,
15
'Phone' : 3333
16
},
17
18
]
19
My solution is to loop my dictionary and use multiprocessing from multiprocessing import Process, Manager
JavaScript
1
29
29
1
def calculateTime(ns, value):
2
# get data with share of each process
3
df_data2 = ns.df_data
4
result2 = ns.result
5
6
# Create new DF from index and old DF
7
df_sampleresult = df_data2.loc[value].reset_index(drop = True)
8
9
# create sample list to save data need to append in final result
10
dict_sample['CustID'] = df_sampleresult['CustID'][0]
11
dict_sample['Time_Score'] = df_sampleresult['Score'].mean()
12
13
result2.append(dict_sample)
14
ns.result = result2
15
ns.df_data = df_data
16
17
18
if __name__ == '__main__':
19
result = list()
20
manager = Manager()
21
ns = manager.Namespace()
22
ns.df = df_data
23
ns.result = result
24
25
job = [Process(target = calculateTime, args=(ns,value)) for key,value in
26
index_data.items()]
27
_ = [p.start() for p in job]
28
_ = [p.join() for p in job]
29
But It’s not working. Performance is slow and higher memory? Is my setup multiprocess is right? Have another way to do that?
Advertisement
Answer
JavaScript
1
21
21
1
In [353]: df
2
Out[353]:
3
CustID Score Number1 Number2 Phone
4
0 3396623046050748 2 2 3 0000
5
1 3396623046050748 6 2 3 0000
6
2 3749192045350356 1 56 23 2222
7
3 4605074846433127 67 532 321 3333
8
4 112884719857303 3 11 66 4444
9
5 507466746864539 7 22 96 5555
10
11
12
In [351]: d = df.groupby(['CustID', 'Phone', round(df.Number2.div(df.Number1), 2)])['Score'].mean().reset_index(name='Total_Score').rename(columns={'level_2': 'Number'}).to_dict('records')
13
14
In [352]: d
15
Out[352]:
16
[{'CustID': 112884719857303, 'Phone': 4444, 'Number': 6.0, 'Total_Score': 3},
17
{'CustID': 507466746864539, 'Phone': 5555, 'Number': 4.36, 'Total_Score': 7},
18
{'CustID': 3396623046050748, 'Phone': 0000, 'Number': 1.5, 'Total_Score': 4},
19
{'CustID': 3749192045350356, 'Phone': 2222, 'Number': 0.41, 'Total_Score': 1},
20
{'CustID': 4605074846433127, 'Phone': 3333, 'Number': 0.6, 'Total_Score': 67}]
21