I’m trying to return the highest frequency trigram in a new column in a pandas dataframe for each group of keywords. (Essentially something like a groupby with transform, returning the highest trigram in a new column).
An example dataframe with dummy data
cluster_name keyword 0 summer summer dresses size 10 1 summer summer dresses size 12 2 summer large summer dresses 3 summer summer dresses size 14 4 strappy ladies strappy summer dresses 5 strappy strappy summer dresses uk 2022 6 strappy strappy summer dress 7 strappy strappy summer dresses 8 strappy thin strap summer dresses
Desired Output
cluster_name trigram 0 summer summer dresses size 4 strappy strappy summer dresses
Minimum Reproducible Example
import pandas as pd data = [ ["summer", "summer dresses size 10"], ["summer", "summer dresses size 12"], ["summer", "large summer dresses"], ["summer", "summer dresses size 14"], ["strappy", "ladies strappy summer dresses"], ["strappy", "strappy summer dresses uk 2022"], ["strappy", "strappy summer dress"], ["strappy", "strappy summer dresses"], ["strappy", "thin strap summer dresses"], ] df = pd.DataFrame(data, columns=['cluster_name', 'keyword']) print(df)
What I’ve tried.
I have working code to find bigrams but it’s a bit hacky. It is fast though (much faster than iterows, which I’d be keen to avoid). It was taken from this solution: How to get group-by and get most frequent words and bigrams for each group pandas
The ideal outcome would be a universal solution I could tinker slightly to return unigrams, bigrams or trigrams etc just by changing a single value.
def bigram(row): lst = row['keyword'].split(' ') return bigrams.append([(lst[x].strip(), lst[x+1].strip()) for x in range(len(lst)-1)]) df['parent_cluster'] = df.apply(lambda row: bigram(row), axis=1) df2 = df.groupby('cluster_name').agg({'parent_cluster': 'sum'}) df3 = df2.parent_cluster.apply(lambda row: Counter(row)).to_frame().astype(str) df3["parent_cluster"] = (df3["parent_cluster"].str.split(',').str[0]) # clean up the unigram column to remove the string of the Counter library. df3["parent_cluster"] = df3["parent_cluster"].str.replace("Counter({('", '') df3["parent_cluster"] = df3["parent_cluster"].str.replace("'", '')
Advertisement
Answer
You can use nltk.ngrams
combined with explode
/groupby
/mode
:
from nltk import ngrams # or use a custom function out = (df .assign(keyword=[list(ngrams(s.split(), n=3)) for s in df['keyword']]) .explode('keyword') .groupby('cluster_name')['keyword'].apply(lambda g: g.mode()[0]) )
output:
cluster_name strappy (strappy, summer, dresses) summer (summer, dresses, size) Name: keyword, dtype: object
As strings:
out = (df .assign(keyword=[[' '.join(x) for x in ngrams(s.split(), n=3)] for s in df['keyword']]) .explode('keyword') .groupby('cluster_name')['keyword'].apply(lambda g: g.mode()[0]) .reset_index(name='trigram') )
output:
cluster_name trigram 0 strappy strappy summer dresses 1 summer summer dresses size