I’m trying to create a bigram from a dictionary with a specific condition. Below is the example of the dictionary:
dict_example = {'keywords1': ['africa', 'basic service', 'class', 'develop country', 'disadvantage', 'economic resource', 'social protection system']
The specific condition is that I want to create a bigram if the words in each element are more than 1. Below is the code that I have been working on so far:
keywords_bigram_temp = {} keywords_bigram = {} for k, v in dict_example.items(): keywords_bigram_temp.update({k: [word_tokenize(w) for w in v]}) for k2, v2 in keywords_bigram_temp.items(): keywords_bigram.update({k2: [list(nltk.bigrams(v3)) for v3 in v2 if len(v3) > 1]})
This code works, but instead of returning a normal tuple within a list (I think this is what bigram normally looked like), it returns a tuple within a nested list. Below is an example of the result:
'keywords1': [[('basic', 'service')], [('develop', 'country')], [('economic', 'resource')], [('social', 'protection'), ('social', 'system'), ('protection', 'system'), ('social', 'protection')]}
What I need is a normal bigram structure, a tuple within a list like so:
'keywords1': [('basic', 'service'), ('develop', 'country'), ('economic', 'resource'), ('social', 'protection'), ('protection', 'system')]}
Advertisement
Answer
Here’s a way to do what your question asks using itertools.combinations()
:
from itertools import combinations keywords_bigram = {'keywords1': [x for elem in dict_example['keywords1'] if ' ' in elem for x in combinations(elem.split(), 2)]}
Output:
{'keywords1': [('basic', 'service'), ('develop', 'country'), ('economic', 'resource'), ('social', 'protection'), ('social', 'system'), ('protection', 'system')]}
Explanation:
- in the dict comprehension, use
for elem in dict_example['keywords1'] if ' ' in elem
to iterate over all items in thelist
associated withkeywords1
that contain a' '
character, meaning the words in the element number more than 1 - use the nested loop
for x in combinations(elem.split(), 2)
to include every unique combination of 2 words within the multi-word item
UPDATE:
Based on OP’s clarification that original question contained an extra entry, and that what is required is “in a 'a b c d'
context, it will become ('a','b'),('b','c'),('c','d')
“, here are three alternative solutions.
Solution #1 using walrus operator :=
and dict comprehension:
keywords_bigram = {'keywords1': [x for elem in dict_example['keywords1'] if len(words := elem.split()) > 1 for x in zip(words, words[1:])]}
Solution #2 using a long-hand for loop:
keywords_bigram = {'keywords1': []} for elem in dict_example['keywords1']: words = elem.split() if len(words) > 1: keywords_bigram['keywords1'].extend(zip(words, words[1:]))
Solution #3 without zip()
:
keywords_bigram = {'keywords1': []} for elem in dict_example['keywords1']: words = elem.split() if len(words) > 1: for i in range(len(words) - 1): keywords_bigram['keywords1'].append(tuple(words[i:i+2]))
All three solutions give identical output:
{'keywords1': [('basic', 'service'), ('develop', 'country'), ('economic', 'resource'), ('social', 'protection'), ('protection', 'system')]}