I trying to make a simple positional index that but having some problems getting the correct output.
Given a list of strings (sentences) I want to use the string position in the sting list as document id and then iterate over the words in the sentence and use the words index in the sentence as its position. Then update a dictionary of words with a tuple of the doc id and it’s position in the doc.
Code:
main func –
def doc_pos_index(alist): inv_index= {} words = [word for line in alist for word in line.split(" ")] for word in words: if word not in inv_index: inv_index[word]=[] for item, index in enumerate(alist): # find item and it's index in list for item2, index2 in enumerate(alist[item]): # for words in string find word and it's index if item2 in inv_index: inv_index[i].append(tuple(index, index2)) # if word in index update it's list with tuple of doc index and position return inv_index
example list:
doc_list= [ 'hello Delivered dejection necessary objection do mr prevailed', 'hello Delivered dejection necessary objection do mr prevailed', 'hello Delivered dejection necessary objection do mr prevailed', 'hello Delivered dejection necessary objection do mr prevailed', 'hello Delivered dejection necessary objection do mr prevailed' ]
desired output:
{'Delivered': [(0,1),(1,1),(2,1),(3,1),(4,1)], 'necessary': [(0,3),(1,3),(2,3),(3,3),(4,3)], 'dejection': [(0,2),(1,2),(2,2),(3,2),(4,2)], ect...}
Current output:
{'Delivered': [], 'necessary': [], 'dejection': [], 'do': [], 'objection': [], 'prevailed': [], 'mr': [], 'hello': []}
An fyi, I do know about collections libarary and NLTK but I’m mainly doing this for learning/practice reasons.
Advertisement
Answer
Check this:
>>> result = {} >>> for doc_id,doc in enumerate(doc_list): for word_pos,word in enumerate(doc.split()): result.setdefault(word,[]).append((doc_id,word_pos)) >>> result {'Delivered': [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], 'necessary': [(0, 3), (1, 3), (2, 3), (3, 3), (4, 3)], 'dejection': [(0, 2), (1, 2), (2, 2), (3, 2), (4, 2)], 'do': [(0, 5), (1, 5), (2, 5), (3, 5), (4, 5)], 'objection': [(0, 4), (1, 4), (2, 4), (3, 4), (4, 4)], 'prevailed': [(0, 7), (1, 7), (2, 7), (3, 7), (4, 7)], 'mr': [(0, 6), (1, 6), (2, 6), (3, 6), (4, 6)], 'hello': [(0, 0), (1, 0), (2, 0), (3, 0), (4, 0)]} >>>