Skip to content
Advertisement

Extracting 2gram strings from a column present in a list

I have a dataframe called df

JavaScript

and a list called algorithms

JavaScript

So technically, for each row of the Comments column, I’m trying to extract words that appear in the algorithms list. This is what I’m trying to achieve

JavaScript

However, this is what I’m getting

JavaScript

words like machine learning and fraud detection don’t appear. basically, all 2 grams words

This is the code I used

JavaScript

Advertisement

Answer

You can pandas.Series.str.findall in combination with join.

JavaScript

we use join to join your strings in your ml list and add a | between each string to capture value 1 OR value 2 etc. Then we use findall to find all occurrences.

Please note that it uses an f-string, so you’ll need python 3.6+. Let me know if you have a lower version of python.


For anyone interested in benchmarks (since we have 3 answers), using each solution with 9.6M rows and running each one 10 times in a row give us the following results:

  • AlexK:
    • mean: 14.94 sec
    • min: 12.43 sec
    • max: 17.08 sec
  • Teddy:
    • mean: 22.67 sec
    • min: 18.25 sec
    • max: 27.64 sec
  • AbsoluteSpace
    • mean: 24.12 sec
    • min: 21.25 sec
    • max: 27.53 sec
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement