Skip to content
Advertisement

Create dictionary of context words without stopwords

I am trying to create a dictionary of words in a text and their context. The context should be the list of words that occur within a 5 word window (two words on either side) of the term’s position in the string. Effectively, I want to ignore the stopwords in my output vectors.

My code is below. I can get the stopwords out of my dictionary’s keys but not the values.

words = ["This", "is", "an", "example", "sentence" ]
stopwords = ["it", "the", "was", "of"]
context_size = 2


stripes = {word:words[max(i - context_size,0):j] for word,i,j in zip(words,count(0),count(context_size+1)) if word.lower() not in stopwords}
print(stripes)

the output is:

{'example': ['is', 'an', 'example', 'sentence'], 'sentence': ['an', 'example', 'sentence']}

Advertisement

Answer

words = ["This", "is", "a", "longer", "example", "sentence"]
stopwords = set(["it", "the", "was", "of", "is", "a"])
context_size = 2

stripes = []
for index, word in enumerate(words):
    if word.lower() in stopwords:
        continue
    i = max(index - context_size, 0)
    j = min(index + context_size, len(words) - 1) + 1
    context = words[i:index] + words[index + 1:j]
    stripes.append((word, context))

print(stripes)

I would recommend to use a tuple list so in case a word occurs more than once in words the dict does not just contain the last one which overwrites previous ones. I would also put stopwords in a set, especially if its a larger list like NLTKs stopwords since that speeds up things.

I also excluded the word itself from the context but depending on how you want to use it you might want to include it.

This results in:

[('This', ['is', 'a']), ('longer', ['is', 'a', 'example', 'sentence']), ('example', ['a', 'longer', 'sentence']), ('sentence', ['longer', 'example'])]
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement