I am a noob to groupby methods in Pandas and can’t seem to get my head wrapped around it. I have data with ~2M records and my current code will take 4 days to execute – due to the inefficient use of ‘append’.
I am analyzing data from manufacturing with 2 flags for indicating problems with the test specimens. The first few flags from each Test_ID should be set to False. (Reason: there is not sufficient data to accurately analyze these first few of each group)
My inefficient attempt (right result, but not fast enought for 2M rows):
JavaScript
x
35
35
1
df = pd.DataFrame({'Test_ID' : ['foo', 'foo', 'foo', 'foo',
2
'bar', 'bar', 'bar'],
3
'TEST_Date' : ['2020-01-09 09:49:31',
4
'2020-01-09 12:16:15',
5
'2020-01-09 12:47:44',
6
'2020-01-09 14:39:05',
7
'2020-01-09 17:39:47',
8
'2020-01-09 20:44:58',
9
'2020-01-10 18:40:47'],
10
'Flag1' : [True, False, True, False, True, False, False],
11
'Flag2' : [True, False, False, False, True, False, False],
12
})
13
14
#generate a list of Test_IDs
15
Test_IDs = list(df['Test_ID'].unique())
16
17
#generate a list of columns in the dataframe
18
cols = list(df)
19
20
#generate a new dataframe with the same columns as the original
21
df_output = pd.DataFrame(columns = cols)
22
23
for i in Test_IDs:
24
#split the data into groups, iterate over each group
25
df_2 = df[df['Test_ID'] == i].copy()
26
27
#set the first two rows of Flag1 to False for each group
28
df_2.iloc[:2, df_2.columns.get_loc('Flag1')] = 0
29
30
#set the first three rows of Flag2 to False for each group
31
df_2.iloc[:3, df_2.columns.get_loc('Flag2')] = 0
32
33
df_output = df_output.append(df_2) #add the latest group onto the output df
34
print(df_output)
35
Input:
JavaScript
1
9
1
Flag1 Flag2 TEST_Date Test_ID
2
0 True True 2020-01-09 09:49:31 foo
3
1 False False 2020-01-09 12:16:15 foo
4
2 True False 2020-01-09 12:47:44 foo
5
3 False False 2020-01-09 14:39:05 foo
6
4 True True 2020-01-09 17:39:47 bar
7
5 False False 2020-01-09 20:44:58 bar
8
6 False False 2020-01-10 18:40:47 bar
9
Output:
JavaScript
1
9
1
Flag1 Flag2 TEST_Date Test_ID
2
0 False False 2020-01-09 09:49:31 foo
3
1 False False 2020-01-09 12:16:15 foo
4
2 True False 2020-01-09 12:47:44 foo
5
3 False False 2020-01-09 14:39:05 foo
6
4 False False 2020-01-09 17:39:47 bar
7
5 False False 2020-01-09 20:44:58 bar
8
6 False False 2020-01-10 18:40:47 bar
9
Advertisement
Answer
Let’s do groupby().cumcount()
:
JavaScript
1
7
1
# enumeration of rows within each `Test_ID`
2
enum = df.groupby('Test_ID').cumcount()
3
4
# overwrite the Flags
5
df.loc[enum < 2, 'Flag1'] = False
6
df.loc[enum < 3, 'Flag2'] = False
7
Output:
JavaScript
1
9
1
Test_ID TEST_Date Flag1 Flag2
2
0 foo 2020-01-09 09:49:31 False False
3
1 foo 2020-01-09 12:16:15 False False
4
2 foo 2020-01-09 12:47:44 True False
5
3 foo 2020-01-09 14:39:05 False False
6
4 bar 2020-01-09 17:39:47 False False
7
5 bar 2020-01-09 20:44:58 False False
8
6 bar 2020-01-10 18:40:47 False False
9