I created a function to extract sentences from a specific key in a nested file. Now I would like to include in this function a label each time it comes to a new dictionary.
Each time the the value HEADER appears marks the begining of a NEW story. So I would like to label the sentences that belong to the same story. And differentiate those that are different.
The data looks like the following:
JavaScript
x
12
12
1
sentences = [{'c': 'HEADER', 'a1': {'a': 'Opus dei, la vie en rose.', 'x': 'l'}},
2
{'d': 'm', 'a1': {'a': 'Ipsum lorem, Suspendisse posuere.', 'x': '4'}},
3
{'c': 'j', 'a1': {'a': 'Nulla elementum, augue fringilla tincidunt ullamcorper.'}},
4
{'c':'h', 'b': 'p'},
5
{'a1': {'a': 'Ut sollicitudin mauris sem, ut ultricies ante accumsan dictum.'}},
6
{'c': 'HEADER', 'a1': {'a': 'NEW Opus dei, la vie en rose.', 'x': 'l'}},
7
{'d': 'm', 'a1': {'a': 'NEW Ipsum lorem, Suspendisse posuere.', 'x': '4'}},
8
{'c': 'j', 'a1': {'a': 'NEW Nulla elementum, augue fringilla tincidunt ullamcorper.'}},
9
{'c':'h', 'b': 'p'},
10
{'a1': {'a': 'NEW Ut sollicitudin mauris sem, ut ultricies ante accumsan dictum.'}}]
11
12
The function
JavaScript
1
10
10
1
def prhases_and_labels(data):
2
a1 = [d for d in data if 'a1' in d]
3
text = []
4
for i in a1:
5
text.append(i['a1']['a'])
6
7
df = pd.DataFrame({'text': text})
8
return df
9
10
The result that I would like to obtain (with the labels in a new column)
Advertisement
Answer
You can iterate over the records and increment the label every time the c
value is HEADER
.
JavaScript
1
26
26
1
sentences = [{'c': 'HEADER', 'a1': {'a': 'Opus dei, la vie en rose.', 'x': 'l'}},
2
{'d': 'm', 'a1': {'a': 'Ipsum lorem, Suspendisse posuere.', 'x': '4'}},
3
{'c': 'j', 'a1': {'a': 'Nulla elementum, augue fringilla tincidunt ullamcorper.'}},
4
{'c':'h', 'b': 'p'},
5
{'a1': {'a': 'Ut sollicitudin mauris sem, ut ultricies ante accumsan dictum.'}},
6
{'c': 'HEADER', 'a1': {'a': 'NEW Opus dei, la vie en rose.', 'x': 'l'}},
7
{'d': 'm', 'a1': {'a': 'NEW Ipsum lorem, Suspendisse posuere.', 'x': '4'}},
8
{'c': 'j', 'a1': {'a': 'NEW Nulla elementum, augue fringilla tincidunt ullamcorper.'}},
9
{'c':'h', 'b': 'p'},
10
{'a1': {'a': 'NEW Ut sollicitudin mauris sem, ut ultricies ante accumsan dictum.'}}]
11
12
13
def prhases_and_labels(data):
14
label = 0
15
res = {'text':[], 'label': []}
16
for record in data:
17
if 'a1' in record:
18
line = record['a1']['a']
19
if record.get('c') == 'HEADER':
20
label += 1
21
22
res['text'].append(line)
23
res['label'].append(label)
24
25
return pd.DataFrame(res)
26
Output:
JavaScript
1
12
12
1
>>> prhases_and_labels(sentences)
2
3
text label
4
0 Opus dei, la vie en rose. 1
5
1 Ipsum lorem, Suspendisse posuere. 1
6
2 Nulla elementum, augue fringilla tincidunt ullamcorper. 1
7
3 Ut sollicitudin mauris sem, ut ultricies ante accumsan dictum. 1
8
4 NEW Opus dei, la vie en rose. 2
9
5 NEW Ipsum lorem, Suspendisse posuere. 2
10
6 NEW Nulla elementum, augue fringilla tincidunt ullamcorper. 2
11
7 NEW Ut sollicitudin mauris sem, ut ultricies ante accumsan dictum. 2
12