Taking as example the following table:
index | column_1 | column_2 |
---|---|---|
0 | bli bli | d e |
1 | bla bla | a b c d e |
2 | ble ble | a b c |
If I give a token_list = ['c', 'e']
I want to order the table by the number of times the tokens each row contains in column number 2.
By ordering the table I should get the following:
index | column_1 | column_2 | score_tmp |
---|---|---|---|
1 | bla bla | a b c d e | 2 |
0 | bli bli | d e | 1 |
2 | ble ble | a b c | 1 |
Currently, I have reached the following way of doing this, but it is taking a lot of time. How could I improve the time? Thank you in advance.
df['score_tmp'] = df[['column_2']].apply( lambda x: len([True for token in token_list if token in str(x['column_2'])]), axis=1) results = df.sort_values('score_tmp', ascending=False).loc[df['score_tmp'] == len(token_list)].reset_index(inplace=False).to_dict('records')
Advertisement
Answer
You can split
column_2 based on whitespaces, convert each row into a set
and then use df.apply
with set intersection
with sort_values
:
In [200]: df['matches'] = df.column_2.str.split().apply(lambda x: set(x) & set(token_list)).str.len() In [204]: df.sort_values('matches', ascending=False).drop('matches', 1) Out[204]: index column_1 column_2 1 1 bla bla a b c d e 0 0 bli bli d e 2 2 ble ble a b c
Timings:
In [208]: def f1(): ...: df['score_tmp'] = df[['column_2']].apply(lambda x: len([True for token in token_list if token in str(x['column_2'])]), axis=1) ...: results = df.sort_values('score_tmp', ascending=False).loc[df['score_tmp'] == len(token_list)].reset_index(inplace=False).to_dict('records') ...: In [209]: def f2(): ...: df['matches'] = df.column_2.str.split().apply(lambda x: set(x) & set(token_list)).str.len() ...: df.sort_values('matches', ascending=False).drop('matches', 1) ...: In [210]: %timeit f1() # solution provided in question 2.36 ms ± 55.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) In [211]: %timeit f2() # my solution 1.22 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)