Skip to content
Advertisement

Efficient way of extracting co-occurence values of specific word pairs from Python Counter() results

I am quite new to Python, so I am not sure if there’s a simple solution to my problem. I have a large corpus of text split into ~40,000 documents, each in one row (already tokenized so each word in a row is a token). I calculated the co-occurrences for each two-word combination, using the following code:

import itertools
from itertools import combinations
from collections import Counter

cooccurrences = []

for tokens in data['tokenized_text']:
    tokens_pairs = itertools.combinations(tokens, 2)
    for pair in tokens_pairs:
        cooccurrences.append(tuple(sorted(pair)))

word_cooccurrence_counter = Counter(cooccurrences)

I can calculate the frequency of co-occurence of any two words like this:

word_cooccurrence_counter['foo','faa']

[output]: 
124

Now I would like to be able to get these results for a specific set of words, over all pairs where they appear. So for instance, for the word ‘foo’, I’d like to get all the words for which the co-occurrence frequency is more than 0.

I have tried doing this using a loop over all the words in the corpus:

outputs = []
# lst is a flat list of all the tokenized words in the corpus

for word in lst:
    get_results = word_cooccurrence_counter['foo', word]
    outputs.append([word, get_results])

This works, but because my corpus is so large, it crashes half the time. And at any rate I have a couple hundred words I’d like to do this for (beyond ‘foo’).

Is there a more efficient way of doing this? For instance, I thought of setting a threshold for the minimum co-occurrence frequency – but then it would still loop through my whole list of words in the corpus (there are thousands).

Any help is really appreciated!

Thanks

Advertisement

Answer

You do not have to iterate over the entire corpus, only over the Counter object (which is inherently smaller):

for words_tuple, count in word_cooccurrence_counter.items():
    if 'foo' in words_tuple:
        print(words_tuple, count)

Also, your original implementation would have found the counts of ('foo', word), but ignore the counts of (word, 'foo') (assuming the order does matter). This will find/count both.

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement