Taking as example the following table:
index | column_1 | column_2 |
---|---|---|
0 | bli bli | d e |
1 | bla bla | a b c d e |
2 | ble ble | a b c |
If I give a token_list = ['c', 'e']
I want to order the table by the number of times the tokens each row contains in column number 2.
By ordering the table I should get the following:
index | column_1 | column_2 | score_tmp |
---|---|---|---|
1 | bla bla | a b c d e | 2 |
0 | bli bli | d e | 1 |
2 | ble ble | a b c | 1 |
Currently, I have reached the following way of doing this, but it is taking a lot of time. How could I improve the time? Thank you in advance.
JavaScript
x
5
1
df['score_tmp'] = df[['column_2']].apply(
2
lambda x: len([True for token in token_list if
3
token in str(x['column_2'])]), axis=1)
4
results = df.sort_values('score_tmp', ascending=False).loc[df['score_tmp'] == len(token_list)].reset_index(inplace=False).to_dict('records')
5
Advertisement
Answer
You can split
column_2 based on whitespaces, convert each row into a set
and then use df.apply
with set intersection
with sort_values
:
JavaScript
1
9
1
In [200]: df['matches'] = df.column_2.str.split().apply(lambda x: set(x) & set(token_list)).str.len()
2
3
In [204]: df.sort_values('matches', ascending=False).drop('matches', 1)
4
Out[204]:
5
index column_1 column_2
6
1 1 bla bla a b c d e
7
0 0 bli bli d e
8
2 2 ble ble a b c
9
Timings:
JavaScript
1
16
16
1
In [208]: def f1():
2
df['score_tmp'] = df[['column_2']].apply(lambda x: len([True for token in token_list if token in str(x['column_2'])]), axis=1) :
3
results = df.sort_values('score_tmp', ascending=False).loc[df['score_tmp'] == len(token_list)].reset_index(inplace=False).to_dict('records') :
4
:
5
6
In [209]: def f2():
7
df['matches'] = df.column_2.str.split().apply(lambda x: set(x) & set(token_list)).str.len() :
8
df.sort_values('matches', ascending=False).drop('matches', 1) :
9
:
10
11
In [210]: %timeit f1() # solution provided in question
12
2.36 ms ± 55.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
13
14
In [211]: %timeit f2() # my solution
15
1.22 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
16