I have a large Pandas dataframe that looks as follows (85k rows):
df1 = pd.DataFrame({"ID": [1, 1, 2, 3, 3, 3], "BEGDT": [pd.to_datetime("1986-01-01"), pd.to_datetime("1989-01-01"), pd.to_datetime("1988-01-01"), pd.to_datetime("1983-01-01"), pd.to_datetime("1986-01-01"), pd.to_datetime("1987-01-01")], "ENDDT": [pd.to_datetime("1988-12-31"), pd.to_datetime("1989-12-31"), pd.to_datetime("1990-12-31"), pd.to_datetime("1985-12-31"), pd.to_datetime("1986-12-31"), pd.to_datetime("1990-12-31")], "Inception": [pd.to_datetime("1984-12-04"), pd.to_datetime("1984-12-04"), pd.to_datetime("1987-06-07"), pd.to_datetime("1982-05-08"), pd.to_datetime("1982-05-08"), pd.to_datetime("1982-05-08")], "NAME": ["Juan", "Jerome", "Pedro", "Javier", "Pastor", "Daniel"]})
My goal is the following: For the first observation of each ID for which the BEGDT > Inception
, copy the row and change the BEGDT
to Inception
and the ENDDT
to BEGDT - 1 day
of the initially copied row.
Accordingly, the final output should look as follows:
df2 = pd.DataFrame({"ID": [1, 1, 1, 2, 2, 3, 3, 3, 3], "BEGDT": [pd.to_datetime("1984-12-04"), pd.to_datetime("1986-01-01"), pd.to_datetime("1989-01-01"), pd.to_datetime("1987-06-07"), pd.to_datetime("1988-01-01"), pd.to_datetime("1982-05-08"), pd.to_datetime("1983-01-01"), pd.to_datetime("1986-01-01"), pd.to_datetime("1987-01-01")], "ENDDT": [pd.to_datetime("1985-12-31"), pd.to_datetime("1988-12-31"), pd.to_datetime("1989-12-31"), pd.to_datetime("1987-12-31"), pd.to_datetime("1990-12-31"), pd.to_datetime("1982-12-31"), pd.to_datetime("1985-12-31"), pd.to_datetime("1986-12-31"), pd.to_datetime("1990-12-31")], "Inception": [pd.to_datetime("1984-12-04"), pd.to_datetime("1984-12-04"), pd.to_datetime("1984-12-04"), pd.to_datetime("1987-06-07"), pd.to_datetime("1987-06-07"), pd.to_datetime("1982-05-08"), pd.to_datetime("1982-05-08"), pd.to_datetime("1982-05-08"), pd.to_datetime("1982-05-08")], "NAME": ["Juan", "Juan", "Jerome", "Pedro", "Pedro", "Javier", "Javier", "Pastor", "Daniel"]})
I assume that first, I have to group the data with df1.groupby("ID").first()
, next do the calculations and finally, insert these rows into df1
. However, I am not sure if this is the best way to do it.
Any help would be appreciated.
Advertisement
Answer
Editing the values can be done on a copy of the dataframe (we’ll call it tmp
) to expedite things, rather than within the groupby on each individual group. We can then filter by BEGDT > Inception
, groupby.first
, like you said, get the index values, fetch those rows from our copy and combine the two:
tmp = df1.copy() tmp['ENDDT'] = tmp.BEGDT - pd.Timedelta('1 day') tmp['BEGDT'] = tmp.Inception (pd.concat([ df1, tmp.loc[ df1[df1.BEGDT > df1.Inception] .reset_index() .groupby('ID') .first()['index']]]) .reset_index(drop=True) .sort_values(by='ID') )