Skip to content
Advertisement

simple in memory positional inverted index in python

I trying to make a simple positional index that but having some problems getting the correct output.

Given a list of strings (sentences) I want to use the string position in the sting list as document id and then iterate over the words in the sentence and use the words index in the sentence as its position. Then update a dictionary of words with a tuple of the doc id and it’s position in the doc.

Code:

main func –

def doc_pos_index(alist):
    inv_index= {}
    words = [word for line in alist for word in line.split(" ")]

    for word in words:
        if word not in inv_index:
            inv_index[word]=[]

    for item, index in enumerate(alist): # find item and it's index in list
        for item2, index2 in enumerate(alist[item]): # for words in string find word and it's index
            if item2 in inv_index:
                inv_index[i].append(tuple(index, index2)) # if word in index update it's list with tuple of doc index and position

    return inv_index 

example list:

doc_list= [
'hello Delivered dejection necessary objection do mr prevailed',
'hello Delivered dejection necessary objection do mr prevailed',
'hello Delivered dejection necessary objection do mr prevailed',
'hello Delivered dejection necessary objection do mr prevailed',
'hello Delivered dejection necessary objection do mr prevailed'
]

desired output:

{'Delivered': [(0,1),(1,1),(2,1),(3,1),(4,1)],
'necessary': [(0,3),(1,3),(2,3),(3,3),(4,3)], 
'dejection': [(0,2),(1,2),(2,2),(3,2),(4,2)],
 ect...}

Current output:

{'Delivered': [],
'necessary': [], 
'dejection': [], 
'do': [],
'objection': [], 
'prevailed': [], 
'mr': [], 
'hello': []}

An fyi, I do know about collections libarary and NLTK but I’m mainly doing this for learning/practice reasons.

Advertisement

Answer

Check this:

>>> result = {}
>>> for doc_id,doc in enumerate(doc_list):
        for word_pos,word in enumerate(doc.split()):
            result.setdefault(word,[]).append((doc_id,word_pos))


>>> result
{'Delivered': [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], 'necessary': [(0, 3), (1, 3), (2, 3), (3, 3), (4, 3)], 'dejection': [(0, 2), (1, 2), (2, 2), (3, 2), (4, 2)], 'do': [(0, 5), (1, 5), (2, 5), (3, 5), (4, 5)], 'objection': [(0, 4), (1, 4), (2, 4), (3, 4), (4, 4)], 'prevailed': [(0, 7), (1, 7), (2, 7), (3, 7), (4, 7)], 'mr': [(0, 6), (1, 6), (2, 6), (3, 6), (4, 6)], 'hello': [(0, 0), (1, 0), (2, 0), (3, 0), (4, 0)]}
>>> 
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement