I have a dataframe called df
Gender Country Comments male USA machine learning and fraud detection are a must learn male Canada monte carlo method is great and so is hmm,pca, svm and neural net female USA clustering and cloud computing female Germany logistical regression and data management and fraud detection female Nigeria nltk and supervised machine learning male Ghana financial engineering and cross validation and time series
and a list called algorithms
algorithms = ['machine learning','fraud detection', 'monte carlo method', 'time series', 'cross validation', 'supervised machine learning', 'logistical regression', 'nltk','clustering', 'data management','cloud computing','financial engineering']
So technically, for each row of the Comments column, I’m trying to extract words that appear in the algorithms list. This is what I’m trying to achieve
Gender Country algorithms male USA machine learning, fraud detection male Canada monte carlo method, hmm,pca, svm, neural net female USA clustering, cloud computing female Germany logistical regression, data management, fraud detection female Nigeria nltk, supervised machine learning male Ghana financial engineering, cross validation, time series
However, this is what I’m getting
Gender Country algorithms male USA male Canada hmm pca svm female USA clustering female Germany female Nigeria nltk male Ghana
words like machine learning and fraud detection don’t appear. basically, all 2 grams words
This is the code I used
df['algorithms'] = df['Comments'].apply(lambda x: " ".join(x for x in x.split() if x in algorithms))
Advertisement
Answer
You can pandas.Series.str.findall
in combination with join
.
import pandas as pd import re df['algo_new'] = df.algo.str.findall(f"({ '|'.join(ml) })") >> out col1 gender algo algo_new 0 usa male machine learning and fraud detection are a mus... [machine learning, fraud detection, clustering] 1 fr female monte carlo method is great and so is hmm,pca,... [monte carlo method] 2 arg male logistical regression and data management and ... [logistical regression, data management, fraud..
we use join
to join your strings in your ml
list and add a |
between each string to capture value 1 OR
value 2 etc. Then we use findall
to find all occurrences.
Please note that it uses an f-string, so you’ll need python 3.6+. Let me know if you have a lower version of python.
For anyone interested in benchmarks (since we have 3 answers), using each solution with 9.6M rows and running each one 10 times in a row give us the following results:
- AlexK:
- mean: 14.94 sec
- min: 12.43 sec
- max: 17.08 sec
- Teddy:
- mean: 22.67 sec
- min: 18.25 sec
- max: 27.64 sec
- AbsoluteSpace
- mean: 24.12 sec
- min: 21.25 sec
- max: 27.53 sec