Skip to content
Advertisement

Extracting 2gram strings from a column present in a list

I have a dataframe called df

Gender  Country      Comments
male    USA        machine learning and fraud detection are a must learn
male    Canada     monte carlo method is great and so is hmm,pca, svm and neural net
female  USA        clustering and cloud computing
female  Germany    logistical regression and data management and fraud detection
female  Nigeria    nltk and supervised machine learning
male    Ghana      financial engineering and cross validation and time series

and a list called algorithms

algorithms = ['machine learning','fraud detection', 'monte carlo method', 'time series', 'cross validation', 'supervised machine learning', 'logistical regression', 'nltk','clustering', 'data management','cloud computing','financial engineering']

So technically, for each row of the Comments column, I’m trying to extract words that appear in the algorithms list. This is what I’m trying to achieve

Gender  Country      algorithms
male    USA        machine learning, fraud detection 
male    Canada     monte carlo method, hmm,pca, svm, neural net
female  USA        clustering, cloud computing
female  Germany    logistical regression, data management, fraud detection
female  Nigeria    nltk, supervised machine learning
male    Ghana      financial engineering, cross validation, time series

However, this is what I’m getting

Gender  Country      algorithms
male    USA         
male    Canada     hmm pca svm  
female  USA        clustering
female  Germany    
female  Nigeria    nltk
male    Ghana      

words like machine learning and fraud detection don’t appear. basically, all 2 grams words

This is the code I used

df['algorithms'] = df['Comments'].apply(lambda x: " ".join(x for x in x.split() if x in algorithms)) 

Advertisement

Answer

You can pandas.Series.str.findall in combination with join.

import pandas as pd
import re

df['algo_new'] = df.algo.str.findall(f"({ '|'.join(ml) })")

>> out

    col1    gender  algo                                                algo_new
0   usa     male    machine learning and fraud detection are a mus...   [machine learning, fraud detection, clustering]
1   fr      female  monte carlo method is great and so is hmm,pca,...   [monte carlo method]
2   arg     male    logistical regression and data management and ...   [logistical regression, data management, fraud..

we use join to join your strings in your ml list and add a | between each string to capture value 1 OR value 2 etc. Then we use findall to find all occurrences.

Please note that it uses an f-string, so you’ll need python 3.6+. Let me know if you have a lower version of python.


For anyone interested in benchmarks (since we have 3 answers), using each solution with 9.6M rows and running each one 10 times in a row give us the following results:

  • AlexK:
    • mean: 14.94 sec
    • min: 12.43 sec
    • max: 17.08 sec
  • Teddy:
    • mean: 22.67 sec
    • min: 18.25 sec
    • max: 27.64 sec
  • AbsoluteSpace
    • mean: 24.12 sec
    • min: 21.25 sec
    • max: 27.53 sec
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement