I have dataset with the following columns:
- encounterDate
- FormName
- PersonID
In the FormName, I have the following forms:
- Baseline
- Follow-up
Here’s a sample data:
encounterDate, FormName, PersonID 2019-01-12, Baseline, 01 2020-01-01, Baseline, 01 2019-04-12, Follow-up, 01 2019-13-12, Follow-up, 01 2020-15-01, Follow-up, 01
I would like to have the following table:
encounterDate, FormName, PersonID, Previous_date 2019-01-12, Baseline, 01, null 2020-01-01, Baseline, 01, 2019-01-12 2019-04-12, Follow-up, 01, null 2019-13-12, Follow-up, 01, 2019-04-12 2020-15-01, Follow-up, 01, 2019-13-12
How do I write this code in Python?
Additionally, I would like to also rank them:
encounterDate, FormName, PersonID, Previous_date, Rank 2019-01-12, Baseline, 01, null, 1 2020-01-01, Baseline, 01, 2019-01-12, 2 2019-04-12, Follow-up, 01, null, 1 2019-13-12, Follow-up, 01, 2019-04-12, 2 2020-15-01, Follow-up, 01, 2019-13-12, 3
Here’s my working code in SQL
select encounter_date,FormName,PersonID
, date((select max(enc.encounter_datetime)
from encounter enc
where enc.patient_id=e.patient_id
and enc.encounter_type=e.encounter_type
and date(e.encounter_datetime)>date(enc.encounter_datetime))) previous_date
from encounter e
Thank you in advance.
John
Advertisement
Answer
This would be fairly straight forward in pandas
It looks like you need to groupby both the PersonID and FormName to get the proper groupings. Within those groups you need to shift encounterDate and you need a cumulative count of the same.
cumcount starts at zero, so you may want to add 1 to the rank column to get the desired output.
import pandas as pd
df = pd.DataFrame({
'encounterDate': ['2019-01-12','2020-01-01','2019-04-12','2019-13-12','2020-15-01'],
'FormName': ['Baseline','Baseline','Follow-up','Follow-up','Follow-up'],
'PersonID': [1, 1, 1, 1, 1]
})
df[['Previous_date',
'Rank']] = df.groupby(['PersonID',
'FormName']).agg(Previous_date=('encounterDate','shift'),
Rank=('encounterDate','cumcount'))
df['Rank']+=1
Output
encounterDate FormName PersonID Previous_date Rank 0 2019-01-12 Baseline 1 NaN 1 1 2020-01-01 Baseline 1 2019-01-12 2 2 2019-04-12 Follow-up 1 NaN 1 3 2019-13-12 Follow-up 1 2019-04-12 2 4 2020-15-01 Follow-up 1 2019-13-12 3