I created a function to extract sentences from a specific key in a nested file. Now I would like to include in this function a label each time it comes to a new dictionary.
Each time the the value HEADER appears marks the begining of a NEW story. So I would like to label the sentences that belong to the same story. And differentiate those that are different.
The data looks like the following:
sentences = [{'c': 'HEADER', 'a1': {'a': 'Opus dei, la vie en rose.', 'x': 'l'}}, {'d': 'm', 'a1': {'a': 'Ipsum lorem, Suspendisse posuere.', 'x': '4'}}, {'c': 'j', 'a1': {'a': 'Nulla elementum, augue fringilla tincidunt ullamcorper.'}}, {'c':'h', 'b': 'p'}, {'a1': {'a': 'Ut sollicitudin mauris sem, ut ultricies ante accumsan dictum.'}}, {'c': 'HEADER', 'a1': {'a': 'NEW Opus dei, la vie en rose.', 'x': 'l'}}, {'d': 'm', 'a1': {'a': 'NEW Ipsum lorem, Suspendisse posuere.', 'x': '4'}}, {'c': 'j', 'a1': {'a': 'NEW Nulla elementum, augue fringilla tincidunt ullamcorper.'}}, {'c':'h', 'b': 'p'}, {'a1': {'a': 'NEW Ut sollicitudin mauris sem, ut ultricies ante accumsan dictum.'}}]
The function
def prhases_and_labels(data): a1 = [d for d in data if 'a1' in d] text = [] for i in a1: text.append(i['a1']['a']) df = pd.DataFrame({'text': text}) return df
The result that I would like to obtain (with the labels in a new column)
Advertisement
Answer
You can iterate over the records and increment the label every time the c
value is HEADER
.
sentences = [{'c': 'HEADER', 'a1': {'a': 'Opus dei, la vie en rose.', 'x': 'l'}}, {'d': 'm', 'a1': {'a': 'Ipsum lorem, Suspendisse posuere.', 'x': '4'}}, {'c': 'j', 'a1': {'a': 'Nulla elementum, augue fringilla tincidunt ullamcorper.'}}, {'c':'h', 'b': 'p'}, {'a1': {'a': 'Ut sollicitudin mauris sem, ut ultricies ante accumsan dictum.'}}, {'c': 'HEADER', 'a1': {'a': 'NEW Opus dei, la vie en rose.', 'x': 'l'}}, {'d': 'm', 'a1': {'a': 'NEW Ipsum lorem, Suspendisse posuere.', 'x': '4'}}, {'c': 'j', 'a1': {'a': 'NEW Nulla elementum, augue fringilla tincidunt ullamcorper.'}}, {'c':'h', 'b': 'p'}, {'a1': {'a': 'NEW Ut sollicitudin mauris sem, ut ultricies ante accumsan dictum.'}}] def prhases_and_labels(data): label = 0 res = {'text':[], 'label': []} for record in data: if 'a1' in record: line = record['a1']['a'] if record.get('c') == 'HEADER': label += 1 res['text'].append(line) res['label'].append(label) return pd.DataFrame(res)
Output:
>>> prhases_and_labels(sentences) text label 0 Opus dei, la vie en rose. 1 1 Ipsum lorem, Suspendisse posuere. 1 2 Nulla elementum, augue fringilla tincidunt ullamcorper. 1 3 Ut sollicitudin mauris sem, ut ultricies ante accumsan dictum. 1 4 NEW Opus dei, la vie en rose. 2 5 NEW Ipsum lorem, Suspendisse posuere. 2 6 NEW Nulla elementum, augue fringilla tincidunt ullamcorper. 2 7 NEW Ut sollicitudin mauris sem, ut ultricies ante accumsan dictum. 2