Skip to content
Advertisement

How to drop duplicates in pandas but keep more than the first

Let’s say I have a pandas DataFrame:

import pandas as pd

df = pd.DataFrame({'a': [1,2,2,2,2,1,1,1,2,2]})
>> df
   a
0  1
1  2
2  2
3  2
4  2
5  1
6  1
7  1
8  2
9  2

I want to drop duplicates if they exceed a certain threshold n and replace them with that minimum. Let’s say that n=3. Then, my target dataframe is

>> df
   a
0  1
1  2
2  2
3  2
5  1
6  1
7  1
8  2
9  2

EDIT: Each set of consecutive repetitions is considered separately. In this example, rows 8 and 9 should be kept.

Advertisement

Answer

You can create unique value for each consecutive group, then use groupby and head:

group_value = np.cumsum(df.a.shift() != df.a)
df.groupby(group_value).head(3)

# result:

   a
0  1
1  2
2  2
3  2
5  1
6  1
7  1
8  3
9  3
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement