I have a dataset with 4000 rows, where I have the duplicate rows(e.g. 2, 3, 4 times). I want to find the cumsum of the duplicates over time.
I have used this code to assign the number of duplicity. But it has rearranged the position of ID
JavaScript
x
2
1
df = duplicate_df.value_counts(sort=False, dropna=False).reset_index(name="Duplicity")
2
Output
JavaScript
1
8
1
ID Time Duplicity
2
12345 2020 2
3
12345 2020 2
4
34567 2021 1
5
34696 2020 3
6
34696 2020 3
7
34696 2020 3
8
whereas I want to add the duplicity and the ID remains same position.
JavaScript
1
8
1
ID Time Duplicity
2
34696 2020 3
3
12345 2020 2
4
12345 2020 2
5
34696 2020 3
6
34696 2020 3
7
34567 2021 1
8
How to find cumsum of duplicity over time? Thank you.
Input data:
JavaScript
1
3
1
d = {'ID': [34696, 12345, 12345, 34696, 34696, 34567],
2
'Time': [2020, 2020, 2020, 2020, 2020, 2021]}
3
Advertisement
Answer
Use groupby
and transform
:
JavaScript
1
12
12
1
df['Duplicity'] = df.groupby(['ID', 'Time'])['ID'].transform('size')
2
print(df)
3
4
# Output
5
ID Time Duplicity
6
0 34696 2020 3
7
1 12345 2020 2
8
2 12345 2020 2
9
3 34696 2020 3
10
4 34696 2020 3
11
5 34567 2021 1
12