Pandas create flag if any condition fails

I have a huge pandas dataframe (actually has 5M rows):

df = pd.DataFrame({'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                   'opn1': [20180301, 20180401, 20180501, 20180601, 20180701, 20180801, 20180901, 20181001, 20181101, 20181201],
                   'opn2': [20180401, 20180501, 20180601, 20180701, 20180801, 20180901, 20181001, 20181101, 20181201, 20190101],
                   'opn3': [20180501, 20180601, 20180701, 20180801, 20180901, 20181001, 20181101, 20181201, 20190101, 20190201],
                   'opn4': [20180601, 20180701, 20180801, 20180901, 20181001, 20181101, 20181201, 20190101, 20190201, 20190301],
                   'opn5': [20180701, 20180801, 20180901, 20181001, 20181101, 20181201, 20190101, 20190201, 20190301, 20190401],
                   'cls1': [0, 20180520, 0, 0, 0, 0, 0, 0, 0, 0],
                   'cls2': [0, 0, 0, 0, 20181031, 0, 0, 0, 0, 0],
                   'cls3': [0, 0, 20180725, 0, 20180701, 0, 0, 0, 0, 0],
                   'cls4': [0, 0, 0, 0, 0, 0, 20190101, 0, 0, 0],
                   'cls5': [0, 20180731, 0, 0, 0, 0, 0, 0, 0, 20190510],
                   })

JavaScript
​x
 
df = pd.DataFrame({'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                   'opn1': [20180301, 20180401, 20180501, 20180601, 20180701, 20180801, 20180901, 20181001, 20181101, 20181201],
                   'opn2': [20180401, 20180501, 20180601, 20180701, 20180801, 20180901, 20181001, 20181101, 20181201, 20190101],
                   'opn3': [20180501, 20180601, 20180701, 20180801, 20180901, 20181001, 20181101, 20181201, 20190101, 20190201],
                   'opn4': [20180601, 20180701, 20180801, 20180901, 20181001, 20181101, 20181201, 20190101, 20190201, 20190301],
                   'opn5': [20180701, 20180801, 20180901, 20181001, 20181101, 20181201, 20190101, 20190201, 20190301, 20190401],
                   'cls1': [0, 20180520, 0, 0, 0, 0, 0, 0, 0, 0],
                   'cls2': [0, 0, 0, 0, 20181031, 0, 0, 0, 0, 0],
                   'cls3': [0, 0, 20180725, 0, 20180701, 0, 0, 0, 0, 0],
                   'cls4': [0, 0, 0, 0, 0, 0, 20190101, 0, 0, 0],
                   'cls5': [0, 20180731, 0, 0, 0, 0, 0, 0, 0, 20190510],
                   })
​

My requirement is to create a flag with value as 1 if any of the close dates cls1..5 is < the corresponding open date opn1..5

Example output: for id==2, cls5 < opn5

I want to avoid a loop and run it as fast as possible. There are ~5M rows in my data.

Perhaps np.where with a combination of any?

Answer

You can try using np.any with axis=0. First create the condition and then apply it on the relevant columns:

cond = lambda i: ((df['cls' + str(i)] != 0) & (df['cls' + str(i)] < df['opn' + str(i)])).values
df['flag'] = np.any([cond(i) for i in range(1,6)], axis=0).astype(int)

JavaScript
 
cond = lambda i: ((df['cls' + str(i)] != 0) & (df['cls' + str(i)] < df['opn' + str(i)])).values
df['flag'] = np.any([cond(i) for i in range(1,6)], axis=0).astype(int)
​

This will flag the rows with id=2 and id=5.

Advertisement

Answer