I am working on an NLP assignment and having some problems removing duplicated strings from a pandas column.
The data I am using is tagged, so some of the rows of data were repeated because the same comment could have multiple tags. So what I did was group the data by ID
and Comment
and aggregated based on tags, like so:
JavaScript
x
2
1
docs = docs.groupby(['ID2', 'comment']).agg({'tags':', '.join})
2
After grouping the data, the tags column had duplicates or more of the same tag. I have tried to remove the duplicated tags, to get unique tags, but have not been successful. First, I tried
JavaScript
1
4
1
docs['new_tags'] = (docs['tags'].str.split()
2
.apply(lambda x: OrderedDict.fromkeys(x).keys())
3
.str.join(' '))
4
but it did not remove the duplicated tags. So I tried a simple function to get the unique tags, but that was also not successful. The function is below:
JavaScript
1
8
1
def remove_multiples(txt):
2
tags = list()
3
for t in txt.split():
4
if not t in tags:
5
tags.append(t)
6
return ' '.join(tags)
7
docs['new_tags'] = docs['tags'].map(remove_multiples)
8
Sample data is below:
JavaScript
1
13
13
1
{'ID2': {0: '440', 1: '440', 2: '440', 3: '440', 4: '422', 5: '2422', 6: '422',
2
7: '422', 8: '422', 9: '422', 10: '422', 11: '422', 12: '422', 13: '422', 14: '422',
3
15: '422', 16: '422', 17: '422', 18: '422', 19: '422', 20: '422', 21: '422', 22: '422'},
4
'comment': {0: 'prompt', 1: 'prompt', 2: 'prompt', 3: 'prompt', 4: 'prompt',
5
5: 'prompt', 6: 'prompt', 7: 'great service', 8: 'great service', 9: 'great service',
6
10: 'friendly', 11: 'friendly', 12: 'friendly', 13: 'friendly', 14: 'fairly organized',
7
15: 'fairly organized', 16: 'fairly organized', 17: 'fairly organized',
8
18: 'fairly organized', 19: 'fairly organized', 20: 'fairly organized',
9
21: 'fairly organized', 22: 'fairly organized'},
10
'tags': {0: 'sp', 1: 'sp', 2: 'in', 3: 'ps', 4: 'wr', 5: 'sa', 6: 'sa', 7: 'sp',
11
8: 'gs', 9: 'po', 10: 'av', 11: 'hf', 12: 'cs', 13: 'fr', 14: 'gs', 15: 'ly',
12
16: 'drt', 17: 'co', 18: 'sp', 19: 'na', 20: 'ps', 21: 'ti', 22: 'ti'}}
13
Advertisement
Answer
Is this what you want?
JavaScript
1
14
14
1
docs = (
2
docs.groupby(['ID2', 'comment'], as_index=False)
3
.agg({'tags':lambda tags: ', '.join(tags.unique())})
4
)
5
6
>>> docs
7
8
ID2 comment tags
9
0 422 fairly organized gs, ly, drt, co, sp, na, ps, ti
10
1 422 friendly av, hf, cs, fr
11
2 422 great service sp, gs, po
12
3 422 prompt wr, sa
13
4 440 prompt sp, in, ps
14