I have a huge pandas dataframe (actually has 5M rows):
JavaScript
x
13
13
1
df = pd.DataFrame({'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
2
'opn1': [20180301, 20180401, 20180501, 20180601, 20180701, 20180801, 20180901, 20181001, 20181101, 20181201],
3
'opn2': [20180401, 20180501, 20180601, 20180701, 20180801, 20180901, 20181001, 20181101, 20181201, 20190101],
4
'opn3': [20180501, 20180601, 20180701, 20180801, 20180901, 20181001, 20181101, 20181201, 20190101, 20190201],
5
'opn4': [20180601, 20180701, 20180801, 20180901, 20181001, 20181101, 20181201, 20190101, 20190201, 20190301],
6
'opn5': [20180701, 20180801, 20180901, 20181001, 20181101, 20181201, 20190101, 20190201, 20190301, 20190401],
7
'cls1': [0, 20180520, 0, 0, 0, 0, 0, 0, 0, 0],
8
'cls2': [0, 0, 0, 0, 20181031, 0, 0, 0, 0, 0],
9
'cls3': [0, 0, 20180725, 0, 20180701, 0, 0, 0, 0, 0],
10
'cls4': [0, 0, 0, 0, 0, 0, 20190101, 0, 0, 0],
11
'cls5': [0, 20180731, 0, 0, 0, 0, 0, 0, 0, 20190510],
12
})
13
My requirement is to create a flag with value as 1 if any of the close dates cls1..5
is < the corresponding open date opn1..5
Example output: for id==2, cls5 < opn5
I want to avoid a loop and run it as fast as possible. There are ~5M rows in my data.
Perhaps np.where
with a combination of any
?
Advertisement
Answer
You can try using np.any
with axis=0
. First create the condition and then apply it on the relevant columns:
JavaScript
1
3
1
cond = lambda i: ((df['cls' + str(i)] != 0) & (df['cls' + str(i)] < df['opn' + str(i)])).values
2
df['flag'] = np.any([cond(i) for i in range(1,6)], axis=0).astype(int)
3
This will flag the rows with id=2
and id=5
.