I am quite new to Python, so I am not sure if there’s a simple solution to my problem. I have a large corpus of text split into ~40,000 documents, each in one row (already tokenized so each word in a row is a token). I calculated the co-occurrences for each two-word combination, using the following code:
import itertools from itertools import combinations from collections import Counter cooccurrences = [] for tokens in data['tokenized_text']: tokens_pairs = itertools.combinations(tokens, 2) for pair in tokens_pairs: cooccurrences.append(tuple(sorted(pair))) word_cooccurrence_counter = Counter(cooccurrences)
I can calculate the frequency of co-occurence of any two words like this:
word_cooccurrence_counter['foo','faa'] [output]: 124
Now I would like to be able to get these results for a specific set of words, over all pairs where they appear. So for instance, for the word ‘foo’, I’d like to get all the words for which the co-occurrence frequency is more than 0.
I have tried doing this using a loop over all the words in the corpus:
outputs = [] # lst is a flat list of all the tokenized words in the corpus for word in lst: get_results = word_cooccurrence_counter['foo', word] outputs.append([word, get_results])
This works, but because my corpus is so large, it crashes half the time. And at any rate I have a couple hundred words I’d like to do this for (beyond ‘foo’).
Is there a more efficient way of doing this? For instance, I thought of setting a threshold for the minimum co-occurrence frequency – but then it would still loop through my whole list of words in the corpus (there are thousands).
Any help is really appreciated!
Thanks
Advertisement
Answer
You do not have to iterate over the entire corpus, only over the Counter
object (which is inherently smaller):
for words_tuple, count in word_cooccurrence_counter.items(): if 'foo' in words_tuple: print(words_tuple, count)
Also, your original implementation would have found the counts of ('foo', word)
, but ignore the counts of (word, 'foo')
(assuming the order does matter). This will find/count both.