Skip to content
Advertisement

Problems Removing Duplicated Words from Pandas Row

I am working on an NLP assignment and having some problems removing duplicated strings from a pandas column.

The data I am using is tagged, so some of the rows of data were repeated because the same comment could have multiple tags. So what I did was group the data by ID and Comment and aggregated based on tags, like so:

docs = docs.groupby(['ID2', 'comment']).agg({'tags':', '.join})

After grouping the data, the tags column had duplicates or more of the same tag. I have tried to remove the duplicated tags, to get unique tags, but have not been successful. First, I tried

docs['new_tags'] = (docs['tags'].str.split()
                       .apply(lambda x: OrderedDict.fromkeys(x).keys())
                        .str.join(' '))

but it did not remove the duplicated tags. So I tried a simple function to get the unique tags, but that was also not successful. The function is below:

def remove_multiples(txt):
    tags = list()
    for t in txt.split():
        if not t in tags:
            tags.append(t)
    return ' '.join(tags)
docs['new_tags'] = docs['tags'].map(remove_multiples)

Sample data is below:

{'ID2': {0: '440', 1: '440', 2: '440', 3: '440', 4: '422', 5: '2422', 6: '422', 
7: '422', 8: '422', 9: '422', 10: '422', 11: '422', 12: '422', 13: '422', 14: '422', 
15: '422', 16: '422', 17: '422', 18: '422', 19: '422', 20: '422', 21: '422', 22: '422'}, 
'comment': {0: 'prompt', 1: 'prompt', 2: 'prompt', 3: 'prompt', 4: 'prompt', 
5: 'prompt', 6: 'prompt', 7: 'great service', 8: 'great service', 9: 'great service', 
10: 'friendly', 11: 'friendly', 12: 'friendly', 13: 'friendly', 14: 'fairly organized', 
15: 'fairly organized', 16: 'fairly organized', 17: 'fairly organized',
18: 'fairly organized', 19: 'fairly organized', 20: 'fairly organized',
21: 'fairly organized', 22: 'fairly organized'}, 
'tags': {0: 'sp', 1: 'sp', 2: 'in', 3: 'ps', 4: 'wr', 5: 'sa', 6: 'sa', 7: 'sp', 
8: 'gs', 9: 'po', 10: 'av', 11: 'hf', 12: 'cs', 13: 'fr', 14: 'gs', 15: 'ly', 
16: 'drt', 17: 'co', 18: 'sp', 19: 'na', 20: 'ps', 21: 'ti', 22: 'ti'}}

Advertisement

Answer

Is this what you want?

docs = (
    docs.groupby(['ID2', 'comment'], as_index=False)
        .agg({'tags':lambda tags: ', '.join(tags.unique())})
)

>>> docs

   ID2           comment                             tags
0  422  fairly organized  gs, ly, drt, co, sp, na, ps, ti
1  422          friendly                   av, hf, cs, fr
2  422     great service                       sp, gs, po
3  422            prompt                           wr, sa
4  440            prompt                       sp, in, ps
User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement