Skip to content
Advertisement

Find trigrams for all groupby clusters in a Pandas Dataframe and return in a new column

I’m trying to return the highest frequency trigram in a new column in a pandas dataframe for each group of keywords. (Essentially something like a groupby with transform, returning the highest trigram in a new column).

An example dataframe with dummy data

JavaScript

Desired Output

JavaScript

Minimum Reproducible Example

JavaScript

What I’ve tried.

I have working code to find bigrams but it’s a bit hacky. It is fast though (much faster than iterows, which I’d be keen to avoid). It was taken from this solution: How to get group-by and get most frequent words and bigrams for each group pandas

The ideal outcome would be a universal solution I could tinker slightly to return unigrams, bigrams or trigrams etc just by changing a single value.

JavaScript

Advertisement

Answer

You can use nltk.ngrams combined with explode/groupby/mode:

JavaScript

output:

JavaScript

As strings:

JavaScript

output:

JavaScript
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement