Let's say I have a pandas DataFrame: I want to drop duplicates if they exceed a certain threshold n and replace them with that minimum. Let's say that n=3. Then, my target dataframe is EDIT: Each set of consecutive repetitions is considered separately. In this example, rows 8 and 9 should be kept. Answer You can create unique value for

How to drop duplicates in pandas but keep more than the first

Let’s say I have a pandas DataFrame:

import pandas as pd

df = pd.DataFrame({'a': [1,2,2,2,2,1,1,1,2,2]})
>> df
   a
0  1
1  2
2  2
3  2
4  2
5  1
6  1
7  1
8  2
9  2

JavaScript
​x
 
import pandas as pd
​
df = pd.DataFrame({'a': [1,2,2,2,2,1,1,1,2,2]})
>> df
   a
0  1
1  2
2  2
3  2
4  2
5  1
6  1
7  1
8  2
9  2
​

I want to drop duplicates if they exceed a certain threshold n and replace them with that minimum. Let’s say that n=3. Then, my target dataframe is

JavaScript
 
>> df
   a
0  1
1  2
2  2
3  2
5  1
6  1
7  1
8  2
9  2
​

EDIT: Each set of consecutive repetitions is considered separately. In this example, rows 8 and 9 should be kept.

Answer

You can create unique value for each consecutive group, then use groupby and head:

group_value = np.cumsum(df.a.shift() != df.a)
df.groupby(group_value).head(3)

# result:

   a
0  1
1  2
2  2
3  2
5  1
6  1
7  1
8  3
9  3

JavaScript
 
group_value = np.cumsum(df.a.shift() != df.a)
df.groupby(group_value).head(3)
​
# result:
​
   a
0  1
1  2
2  2
3  2
5  1
6  1
7  1
8  3
9  3
​

Advertisement

Answer