Skip to content
Advertisement

Labeling sentences from different nested dictionaries

I created a function to extract sentences from a specific key in a nested file. Now I would like to include in this function a label each time it comes to a new dictionary.

Each time the the value HEADER appears marks the begining of a NEW story. So I would like to label the sentences that belong to the same story. And differentiate those that are different.

The data looks like the following:

sentences = [{'c': 'HEADER', 'a1': {'a': 'Opus dei, la vie en rose.', 'x': 'l'}},
      {'d': 'm', 'a1': {'a': 'Ipsum lorem, Suspendisse posuere.', 'x': '4'}},
      {'c': 'j', 'a1': {'a': 'Nulla elementum, augue fringilla tincidunt ullamcorper.'}},
      {'c':'h', 'b': 'p'},
      {'a1': {'a': 'Ut sollicitudin mauris sem, ut ultricies ante accumsan dictum.'}},
      {'c': 'HEADER', 'a1': {'a': 'NEW Opus dei, la vie en rose.', 'x': 'l'}},
      {'d': 'm', 'a1': {'a': 'NEW Ipsum lorem, Suspendisse posuere.', 'x': '4'}},
      {'c': 'j', 'a1': {'a': 'NEW Nulla elementum, augue fringilla tincidunt ullamcorper.'}},
      {'c':'h', 'b': 'p'},
      {'a1': {'a': 'NEW Ut sollicitudin mauris sem, ut ultricies ante accumsan dictum.'}}]

The function

def prhases_and_labels(data):
    a1 = [d for d in data if 'a1' in d]
    text = []
    for i in a1:
        text.append(i['a1']['a'])
    
    df = pd.DataFrame({'text': text})
    return df

enter image description here

The result that I would like to obtain (with the labels in a new column) enter image description here

Advertisement

Answer

You can iterate over the records and increment the label every time the c value is HEADER.

sentences = [{'c': 'HEADER', 'a1': {'a': 'Opus dei, la vie en rose.', 'x': 'l'}},
      {'d': 'm', 'a1': {'a': 'Ipsum lorem, Suspendisse posuere.', 'x': '4'}},
      {'c': 'j', 'a1': {'a': 'Nulla elementum, augue fringilla tincidunt ullamcorper.'}},
      {'c':'h', 'b': 'p'},
      {'a1': {'a': 'Ut sollicitudin mauris sem, ut ultricies ante accumsan dictum.'}},
      {'c': 'HEADER', 'a1': {'a': 'NEW Opus dei, la vie en rose.', 'x': 'l'}},
      {'d': 'm', 'a1': {'a': 'NEW Ipsum lorem, Suspendisse posuere.', 'x': '4'}},
      {'c': 'j', 'a1': {'a': 'NEW Nulla elementum, augue fringilla tincidunt ullamcorper.'}},
      {'c':'h', 'b': 'p'},
      {'a1': {'a': 'NEW Ut sollicitudin mauris sem, ut ultricies ante accumsan dictum.'}}]


def prhases_and_labels(data):
    label = 0
    res = {'text':[], 'label': []}
    for record in data:
        if 'a1' in record:
            line = record['a1']['a']
            if record.get('c') == 'HEADER':
                label += 1
                
            res['text'].append(line)
            res['label'].append(label)
            
    return pd.DataFrame(res)        

Output:

>>> prhases_and_labels(sentences)

                                                                 text  label
0                                           Opus dei, la vie en rose.      1
1                                   Ipsum lorem, Suspendisse posuere.      1
2             Nulla elementum, augue fringilla tincidunt ullamcorper.      1
3      Ut sollicitudin mauris sem, ut ultricies ante accumsan dictum.      1
4                                       NEW Opus dei, la vie en rose.      2
5                               NEW Ipsum lorem, Suspendisse posuere.      2
6         NEW Nulla elementum, augue fringilla tincidunt ullamcorper.      2
7  NEW Ut sollicitudin mauris sem, ut ultricies ante accumsan dictum.      2
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement