I have a table in the form:
ID | DATE_ENCOUNTER | LOAD |
---|---|---|
151336 | 2017-08-22 | 40 |
151336 | 2017-08-23 | 40 |
151336 | 2017-08-24 | 40 |
151336 | 2017-08-25 | 40 |
151336 | 2017-09-05 | 50 |
151336 | 2017-09-06 | 50 |
151336 | 2017-10-16 | 51 |
151336 | 2017-10-17 | 51 |
151336 | 2017-10-18 | 51 |
151336 | 2017-10-30 | 50 |
151336 | 2017-10-31 | 50 |
151336 | 2017-11-01 | 50 |
151336 | 2017-12-13 | 62 |
151336 | 2018-01-03 | 65 |
151336 | 2018-02-09 | 60 |
Although the dates are not the same, some records are duplicates (just within a 4 day delta).How do I drop duplicates (earliest records) in a dataframe if the timestamps/dates are close (within 4 day delta) but not identical. The result should present a table like below:
ID | DATE_ENCOUNTER | LOAD |
---|---|---|
151336 | 2017-08-25 | 40 |
151336 | 2017-09-06 | 50 |
151336 | 2017-10-18 | 51 |
151336 | 2017-11-01 | 50 |
151336 | 2017-12-13 | 62 |
151336 | 2018-01-03 | 65 |
151336 | 2018-02-09 | 60 |
I have tried:
JavaScript
x
4
1
m = df.groupby('ID').DATE_ENCOUNTER.apply(lambda x: x.diff().dt.days < 4)
2
m2 = df.ID.duplicated(keep=false) & (m | m.shift(-1))
3
df_dedup2 = df[~m2]
4
Here is some code to generate the dataframe:
JavaScript
1
8
1
import pandas as pd
2
details = {
3
'ID':[151336,151336,151336,151336,151336,151336,151336,151336,151336,151336,151336,151336,151336,151336,151336],
4
'DATE_ENCOUNTER':['2017-08-22','2017-08-23','2017-08-24','2017-08-25','2017-09-05','2017-09-06','2017-10-16','2017-10-17','2017-10-18','2017-10-30','2017-10-31','2017-11-01','2017-12-13','2018-01-03','2018-02-09'],
5
'LOAD':[40,40,40,40,50,50,51,51,51,50,50,50,62,65,60]
6
}
7
df=pd.DataFrame(details)
8
Note there are more fields and more IDs.
Advertisement
Answer
You can use:
JavaScript
1
7
1
df[(df.groupby('ID')
2
['DATE_ENCOUNTER']
3
.diff(-1).dt.days.mul(-1) # calculate the difference
4
.fillna(float('inf')) # make sure last row is kept
5
.ge(4) # select diff >= 4
6
)]
7
output:
JavaScript
1
9
1
ID DATE_ENCOUNTER LOAD
2
3 151336 2017-08-25 40
3
5 151336 2017-09-06 50
4
8 151336 2017-10-18 51
5
11 151336 2017-11-01 50
6
12 151336 2017-12-13 62
7
13 151336 2018-01-03 65
8
14 151336 2018-02-09 60
9