I have the following variable
JavaScript
x
3
1
data = ("Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country. Many people have been killed that day.",
2
{"entities": [(48, 54, 'Category 1'), (77, 81, 'Category 1'), (111, 118, 'Category 2'), (150, 173, 'Category 3')]})
3
data[1]['entities'][0] = (48, 54, 'Category 1')
stands for (start_offset, end_offset, entity)
.
I want to read each word of data[0]
and tag it according to data[1]
entities. I am expecting to have as final output,
JavaScript
1
35
35
1
{
2
'Thousands': 'O',
3
'of': 'O',
4
'demonstrators': 'O',
5
'have': 'O',
6
'marched': 'O',
7
'through': 'O',
8
'London': 'S-1',
9
'to': 'O',
10
'protest': 'O',
11
'the': 'O',
12
'war': 'O',
13
'in': 'O',
14
'Iraq': 'S-1',
15
'and': 'O'
16
'demand': 'O',
17
'the': 'O',
18
'withdrawal': 'O',
19
'of': 'O',
20
'British': 'S-2',
21
'troops': 'O',
22
'from': 'O',
23
'that': 'O',
24
'country': 'O',
25
'.': 'O',
26
'Many': 'O',
27
'people': 'S-3',
28
'have': 'B-3',
29
'been': 'B-3',
30
'killed': 'E-3',
31
'that': 'O',
32
'day': 'O',
33
'.': 'O'
34
}
35
Here, ‘O’ stands for ‘OutOfEntity’, ‘S’ stands for ‘Start’, ‘B’ stands for ‘Between’, and ‘E’ stands for ‘End’ and are unique for every given text.
I tried the following:
JavaScript
1
17
17
1
entities = {}
2
offsets = data[1]['entities']
3
for entity in offsets:
4
entities[data[0][entity[0]:entity[1]]] = re.findall('[0-9]+', entity[2])[0]
5
6
tags = {}
7
for key, value in entities.items():
8
entity = key.split()
9
if len(entity) > 1:
10
bEntity = entity[1:-1]
11
tags[entity[0]] = 'S-'+value
12
tags[entity[-1]] = 'E-'+value
13
for item in bEntity:
14
tags[item] = 'B-'+value
15
else:
16
tags[entity[0]] = 'S-'+value
17
The output will be
JavaScript
1
8
1
{'London': 'S-1',
2
'Iraq': 'S-1',
3
'British': 'S-2',
4
'people': 'S-3',
5
'killed': 'E-3',
6
'have': 'B-3',
7
'been': 'B-3'}
8
From this point, I am stuck on how to deal with ‘O’ entities. Also, I want to build more efficient and readable code. I think dictionary data structure is not going to work more efficiently because I can have the same words which they’ll be as keys.
Advertisement
Answer
JavaScript
1
25
25
1
def ner(data):
2
entities = {}
3
offsets = data[1]['entities']
4
for entity in offsets:
5
entities[data[0][int(entity[0]):int(entity[1])]] = re.findall('[0-9]+', entity[2])[0]
6
7
tags = []
8
for key, value in entities.items():
9
entity = key.split()
10
if len(entity) > 1:
11
bEntity = entity[1:-1]
12
tags.append((entity[0], 'S-'+value))
13
for item in bEntity:
14
tags.append((item, 'B-'+value))
15
tags.append((entity[-1], 'E-'+value))
16
else:
17
tags.append((entity[0], 'S-'+value))
18
19
tokens = nltk.word_tokenize(data[0])
20
OTokens = [(token, 'O') for token in tokens if token not in [token[0] for token in tags]]
21
for token in OTokens:
22
tags.append(token)
23
24
return tags
25