I am trying to create a dictionary of words in a text and their context. The context should be the list of words that occur within a 5 word window (two words on either side) of the term’s position in the string. Effectively, I want to ignore the stopwords in my output vectors.
My code is below. I can get the stopwords out of my dictionary’s keys but not the values.
words = ["This", "is", "an", "example", "sentence" ] stopwords = ["it", "the", "was", "of"] context_size = 2 stripes = {word:words[max(i - context_size,0):j] for word,i,j in zip(words,count(0),count(context_size+1)) if word.lower() not in stopwords} print(stripes)
the output is:
{'example': ['is', 'an', 'example', 'sentence'], 'sentence': ['an', 'example', 'sentence']}
Advertisement
Answer
words = ["This", "is", "a", "longer", "example", "sentence"] stopwords = set(["it", "the", "was", "of", "is", "a"]) context_size = 2 stripes = [] for index, word in enumerate(words): if word.lower() in stopwords: continue i = max(index - context_size, 0) j = min(index + context_size, len(words) - 1) + 1 context = words[i:index] + words[index + 1:j] stripes.append((word, context)) print(stripes)
I would recommend to use a tuple list so in case a word occurs more than once in words
the dict does not just contain the last one which overwrites previous ones. I would also put stopwords in a set, especially if its a larger list like NLTKs stopwords since that speeds up things.
I also excluded the word itself from the context but depending on how you want to use it you might want to include it.
This results in:
[('This', ['is', 'a']), ('longer', ['is', 'a', 'example', 'sentence']), ('example', ['a', 'longer', 'sentence']), ('sentence', ['longer', 'example'])]