Sorting rows by the number of list elements the row contains

Question

Taking as example the following table: index column_1 column_2 0 bli bli d e 1 bla bla a b c d e 2 ble ble a b c If I give a token_list = ['c', 'e'] I want to order the table by the number of times the tokens each row contains in column number 2. By ordering the table

Accepted Answer

You can split column_2 based on whitespaces, convert each row into a set and then use  df.apply with set intersection with sort_values:In [200]: df['matches'] = df.column_2.str.split().apply(lambda x: set(x) & set(token_list)).str.len()In [204]: df.sort_values('matches', ascending=False).drop('matches', 1)Out[204]:    index column_1   column_21      1  bla bla  a b c d e0      0  bli bli        d e2      2  ble ble      a b cTimings:In [208]: def f1():     ...:     df['score_tmp'] = df[['column_2']].apply(lambda x: len([True for token in token_list if token in str(x['column_2'])]), axis=1)     ...:     results = df.sort_values('score_tmp', ascending=False).loc[df['score_tmp'] == len(token_list)].reset_index(inplace=False).to_dict('records')     ...: In [209]: def f2():     ...:     df['matches'] = df.column_2.str.split().apply(lambda x: set(x) & set(token_list)).str.len()     ...:     df.sort_values('matches', ascending=False).drop('matches', 1)     ...: In [210]: %timeit f1() # solution provided in question2.36 ms ± 55.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)In [211]: %timeit f2() # my solution1.22 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Advertisement

Answer