I have a huge pandas dataframe (actually has 5M rows):
df = pd.DataFrame({'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'opn1': [20180301, 20180401, 20180501, 20180601, 20180701, 20180801, 20180901, 20181001, 20181101, 20181201], 'opn2': [20180401, 20180501, 20180601, 20180701, 20180801, 20180901, 20181001, 20181101, 20181201, 20190101], 'opn3': [20180501, 20180601, 20180701, 20180801, 20180901, 20181001, 20181101, 20181201, 20190101, 20190201], 'opn4': [20180601, 20180701, 20180801, 20180901, 20181001, 20181101, 20181201, 20190101, 20190201, 20190301], 'opn5': [20180701, 20180801, 20180901, 20181001, 20181101, 20181201, 20190101, 20190201, 20190301, 20190401], 'cls1': [0, 20180520, 0, 0, 0, 0, 0, 0, 0, 0], 'cls2': [0, 0, 0, 0, 20181031, 0, 0, 0, 0, 0], 'cls3': [0, 0, 20180725, 0, 20180701, 0, 0, 0, 0, 0], 'cls4': [0, 0, 0, 0, 0, 0, 20190101, 0, 0, 0], 'cls5': [0, 20180731, 0, 0, 0, 0, 0, 0, 0, 20190510], })
My requirement is to create a flag with value as 1 if any of the close dates cls1..5
is < the corresponding open date opn1..5
Example output: for id==2, cls5 < opn5
I want to avoid a loop and run it as fast as possible. There are ~5M rows in my data.
Perhaps np.where
with a combination of any
?
Advertisement
Answer
You can try using np.any
with axis=0
. First create the condition and then apply it on the relevant columns:
cond = lambda i: ((df['cls' + str(i)] != 0) & (df['cls' + str(i)] < df['opn' + str(i)])).values df['flag'] = np.any([cond(i) for i in range(1,6)], axis=0).astype(int)
This will flag the rows with id=2
and id=5
.