I am working on an NLP assignment and having some problems removing duplicated strings from a pandas column.
The data I am using is tagged, so some of the rows of data were repeated because the same comment could have multiple tags. So what I did was group the data by ID
and Comment
and aggregated based on tags, like so:
docs = docs.groupby(['ID2', 'comment']).agg({'tags':', '.join})
After grouping the data, the tags column had duplicates or more of the same tag. I have tried to remove the duplicated tags, to get unique tags, but have not been successful. First, I tried
docs['new_tags'] = (docs['tags'].str.split() .apply(lambda x: OrderedDict.fromkeys(x).keys()) .str.join(' '))
but it did not remove the duplicated tags. So I tried a simple function to get the unique tags, but that was also not successful. The function is below:
def remove_multiples(txt): tags = list() for t in txt.split(): if not t in tags: tags.append(t) return ' '.join(tags) docs['new_tags'] = docs['tags'].map(remove_multiples)
Sample data is below:
{'ID2': {0: '440', 1: '440', 2: '440', 3: '440', 4: '422', 5: '2422', 6: '422', 7: '422', 8: '422', 9: '422', 10: '422', 11: '422', 12: '422', 13: '422', 14: '422', 15: '422', 16: '422', 17: '422', 18: '422', 19: '422', 20: '422', 21: '422', 22: '422'}, 'comment': {0: 'prompt', 1: 'prompt', 2: 'prompt', 3: 'prompt', 4: 'prompt', 5: 'prompt', 6: 'prompt', 7: 'great service', 8: 'great service', 9: 'great service', 10: 'friendly', 11: 'friendly', 12: 'friendly', 13: 'friendly', 14: 'fairly organized', 15: 'fairly organized', 16: 'fairly organized', 17: 'fairly organized', 18: 'fairly organized', 19: 'fairly organized', 20: 'fairly organized', 21: 'fairly organized', 22: 'fairly organized'}, 'tags': {0: 'sp', 1: 'sp', 2: 'in', 3: 'ps', 4: 'wr', 5: 'sa', 6: 'sa', 7: 'sp', 8: 'gs', 9: 'po', 10: 'av', 11: 'hf', 12: 'cs', 13: 'fr', 14: 'gs', 15: 'ly', 16: 'drt', 17: 'co', 18: 'sp', 19: 'na', 20: 'ps', 21: 'ti', 22: 'ti'}}
Advertisement
Answer
Is this what you want?
docs = ( docs.groupby(['ID2', 'comment'], as_index=False) .agg({'tags':lambda tags: ', '.join(tags.unique())}) ) >>> docs ID2 comment tags 0 422 fairly organized gs, ly, drt, co, sp, na, ps, ti 1 422 friendly av, hf, cs, fr 2 422 great service sp, gs, po 3 422 prompt wr, sa 4 440 prompt sp, in, ps