Let’s say I have a pandas DataFrame:
import pandas as pd
df = pd.DataFrame({'a': [1,2,2,2,2,1,1,1,2,2]})
>> df
a
0 1
1 2
2 2
3 2
4 2
5 1
6 1
7 1
8 2
9 2
I want to drop duplicates if they exceed a certain threshold n and replace them with that minimum. Let’s say that n=3. Then, my target dataframe is
>> df a 0 1 1 2 2 2 3 2 5 1 6 1 7 1 8 2 9 2
EDIT: Each set of consecutive repetitions is considered separately. In this example, rows 8 and 9 should be kept.
Advertisement
Answer
You can create unique value for each consecutive group, then use groupby and head:
group_value = np.cumsum(df.a.shift() != df.a) df.groupby(group_value).head(3) # result: a 0 1 1 2 2 2 3 2 5 1 6 1 7 1 8 3 9 3