I have the following variable
data = ("Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country. Many people have been killed that day.", {"entities": [(48, 54, 'Category 1'), (77, 81, 'Category 1'), (111, 118, 'Category 2'), (150, 173, 'Category 3')]})
data[1]['entities'][0] = (48, 54, 'Category 1')
stands for (start_offset, end_offset, entity)
.
I want to read each word of data[0]
and tag it according to data[1]
entities. I am expecting to have as final output,
{ 'Thousands': 'O', 'of': 'O', 'demonstrators': 'O', 'have': 'O', 'marched': 'O', 'through': 'O', 'London': 'S-1', 'to': 'O', 'protest': 'O', 'the': 'O', 'war': 'O', 'in': 'O', 'Iraq': 'S-1', 'and': 'O' 'demand': 'O', 'the': 'O', 'withdrawal': 'O', 'of': 'O', 'British': 'S-2', 'troops': 'O', 'from': 'O', 'that': 'O', 'country': 'O', '.': 'O', 'Many': 'O', 'people': 'S-3', 'have': 'B-3', 'been': 'B-3', 'killed': 'E-3', 'that': 'O', 'day': 'O', '.': 'O' }
Here, ‘O’ stands for ‘OutOfEntity’, ‘S’ stands for ‘Start’, ‘B’ stands for ‘Between’, and ‘E’ stands for ‘End’ and are unique for every given text.
I tried the following:
entities = {} offsets = data[1]['entities'] for entity in offsets: entities[data[0][entity[0]:entity[1]]] = re.findall('[0-9]+', entity[2])[0] tags = {} for key, value in entities.items(): entity = key.split() if len(entity) > 1: bEntity = entity[1:-1] tags[entity[0]] = 'S-'+value tags[entity[-1]] = 'E-'+value for item in bEntity: tags[item] = 'B-'+value else: tags[entity[0]] = 'S-'+value
The output will be
{'London': 'S-1', 'Iraq': 'S-1', 'British': 'S-2', 'people': 'S-3', 'killed': 'E-3', 'have': 'B-3', 'been': 'B-3'}
From this point, I am stuck on how to deal with ‘O’ entities. Also, I want to build more efficient and readable code. I think dictionary data structure is not going to work more efficiently because I can have the same words which they’ll be as keys.
Advertisement
Answer
def ner(data): entities = {} offsets = data[1]['entities'] for entity in offsets: entities[data[0][int(entity[0]):int(entity[1])]] = re.findall('[0-9]+', entity[2])[0] tags = [] for key, value in entities.items(): entity = key.split() if len(entity) > 1: bEntity = entity[1:-1] tags.append((entity[0], 'S-'+value)) for item in bEntity: tags.append((item, 'B-'+value)) tags.append((entity[-1], 'E-'+value)) else: tags.append((entity[0], 'S-'+value)) tokens = nltk.word_tokenize(data[0]) OTokens = [(token, 'O') for token in tokens if token not in [token[0] for token in tags]] for token in OTokens: tags.append(token) return tags