I have a dataframe (named df) sorted by identifier, id_number and contract_year_month in order like this so far:
**identifier id_number contract_year_month collection_year_month** K001 1 2018-01-03 2018-01-09 K001 1 2018-01-08 2018-01-10 K001 2 2018-01-01 2018-01-05 K001 2 2018-01-15 2018-01-18 K002 4 2018-01-04 2018-01-07 K002 4 2018-01-09 2018-01-15
and would like to add a column named ‘date_difference’ that is consisted of contract_year_month minus collection_year_month from previous row based on identifier and id_number (e.g. 2018-01-08 minus 2018-01-09), so that the df would be:
**identifier id_number contract_year_month collection_year_month date_difference** K001 1 2018-01-03 2018-01-09 K001 1 2018-01-08 2018-01-10 -1 K001 2 2018-01-01 2018-01-05 K001 2 2018-01-15 2018-01-18 10 K002 4 2018-01-04 2018-01-07 K002 4 2018-01-09 2018-01-15 2
I already converted the type of contract_year_month and collection_year_month columns to datetime, and tried to work on with simple shift function or iloc but neither doesn’t work.
df["date_difference"] = df.groupby(["identifier", "id_number"])["contract_year_month"]
Is there any way to use groupby to get the difference between the current row value and previous row value in another column, separated by two identifiers? (I’ve searched for an hour but couldn’t find a hint…) I would sincerely appreciate if you guys give some advice.
Advertisement
Answer
Here is one potential way to do this.
First create a boolean mask, then use numpy.where
and Series.shift
to create the column date_difference:
mask = df.duplicated(['identifier', 'id_number']) df['date_difference'] = (np.where(mask, (df['contract_year_month'] - df['collection_year_month'].shift(1)).dt.days, np.nan))
[output]
identifier id_number contract_year_month collection_year_month date_difference 0 K001 1 2018-01-03 2018-01-09 NaN 1 K001 1 2018-01-08 2018-01-10 -1.0 2 K001 2 2018-01-01 2018-01-05 NaN 3 K001 2 2018-01-15 2018-01-18 10.0 4 K002 4 2018-01-04 2018-01-07 NaN 5 K002 4 2018-01-09 2018-01-15 2.0