I have a dataframe called df
Gender Country Comments
male USA machine learning and fraud detection are a must learn
male Canada monte carlo method is great and so is hmm,pca, svm and neural net
female USA clustering and cloud computing
female Germany logistical regression and data management and fraud detection
female Nigeria nltk and supervised machine learning
male Ghana financial engineering and cross validation and time series
and a list called algorithms
algorithms = ['machine learning','fraud detection', 'monte carlo method', 'time series', 'cross validation', 'supervised machine learning', 'logistical regression', 'nltk','clustering', 'data management','cloud computing','financial engineering']
So technically, for each row of the Comments column, I’m trying to extract words that appear in the algorithms list. This is what I’m trying to achieve
Gender Country algorithms
male USA machine learning, fraud detection
male Canada monte carlo method, hmm,pca, svm, neural net
female USA clustering, cloud computing
female Germany logistical regression, data management, fraud detection
female Nigeria nltk, supervised machine learning
male Ghana financial engineering, cross validation, time series
However, this is what I’m getting
Gender Country algorithms
male USA
male Canada hmm pca svm
female USA clustering
female Germany
female Nigeria nltk
male Ghana
words like machine learning and fraud detection don’t appear. basically, all 2 grams words
This is the code I used
df['algorithms'] = df['Comments'].apply(lambda x: " ".join(x for x in x.split() if x in algorithms))
Advertisement
Answer
You can pandas.Series.str.findall
in combination with join
.
import pandas as pd
import re
df['algo_new'] = df.algo.str.findall(f"({ '|'.join(ml) })")
>> out
col1 gender algo algo_new
0 usa male machine learning and fraud detection are a mus [machine learning, fraud detection, clustering]
1 fr female monte carlo method is great and so is hmm,pca, [monte carlo method]
2 arg male logistical regression and data management and [logistical regression, data management, fraud..
we use join
to join your strings in your ml
list and add a |
between each string to capture value 1 OR
value 2 etc. Then we use findall
to find all occurrences.
Please note that it uses an f-string, so you’ll need python 3.6+. Let me know if you have a lower version of python.
For anyone interested in benchmarks (since we have 3 answers), using each solution with 9.6M rows and running each one 10 times in a row give us the following results:
- AlexK:
- mean: 14.94 sec
- min: 12.43 sec
- max: 17.08 sec
- Teddy:
- mean: 22.67 sec
- min: 18.25 sec
- max: 27.64 sec
- AbsoluteSpace
- mean: 24.12 sec
- min: 21.25 sec
- max: 27.53 sec