I have dataset with the following columns:
- encounterDate
- FormName
- PersonID
In the FormName, I have the following forms:
- Baseline
- Follow-up
Here’s a sample data:
encounterDate, FormName, PersonID 2019-01-12, Baseline, 01 2020-01-01, Baseline, 01 2019-04-12, Follow-up, 01 2019-13-12, Follow-up, 01 2020-15-01, Follow-up, 01
I would like to have the following table:
encounterDate, FormName, PersonID, Previous_date 2019-01-12, Baseline, 01, null 2020-01-01, Baseline, 01, 2019-01-12 2019-04-12, Follow-up, 01, null 2019-13-12, Follow-up, 01, 2019-04-12 2020-15-01, Follow-up, 01, 2019-13-12
How do I write this code in Python?
Additionally, I would like to also rank them:
encounterDate, FormName, PersonID, Previous_date, Rank 2019-01-12, Baseline, 01, null, 1 2020-01-01, Baseline, 01, 2019-01-12, 2 2019-04-12, Follow-up, 01, null, 1 2019-13-12, Follow-up, 01, 2019-04-12, 2 2020-15-01, Follow-up, 01, 2019-13-12, 3
Here’s my working code in SQL
select encounter_date,FormName,PersonID , date((select max(enc.encounter_datetime) from encounter enc where enc.patient_id=e.patient_id and enc.encounter_type=e.encounter_type and date(e.encounter_datetime)>date(enc.encounter_datetime))) previous_date from encounter e
Thank you in advance.
John
Advertisement
Answer
This would be fairly straight forward in pandas
It looks like you need to groupby
both the PersonID
and FormName
to get the proper groupings. Within those groups you need to shift encounterDate
and you need a cumulative count of the same.
cumcount
starts at zero, so you may want to add 1 to the rank column to get the desired output.
import pandas as pd df = pd.DataFrame({ 'encounterDate': ['2019-01-12','2020-01-01','2019-04-12','2019-13-12','2020-15-01'], 'FormName': ['Baseline','Baseline','Follow-up','Follow-up','Follow-up'], 'PersonID': [1, 1, 1, 1, 1] }) df[['Previous_date', 'Rank']] = df.groupby(['PersonID', 'FormName']).agg(Previous_date=('encounterDate','shift'), Rank=('encounterDate','cumcount')) df['Rank']+=1
Output
encounterDate FormName PersonID Previous_date Rank 0 2019-01-12 Baseline 1 NaN 1 1 2020-01-01 Baseline 1 2019-01-12 2 2 2019-04-12 Follow-up 1 NaN 1 3 2019-13-12 Follow-up 1 2019-04-12 2 4 2020-15-01 Follow-up 1 2019-13-12 3