With the following data, I think I want a column (`DESIRED_DURATION_COL`

) to work out the duration (according to `start_datetime`

) of consecutive Truths:

project_id | start_datetime | diag_local_code | DESIRED_DURATION_COL |
---|---|---|---|

1 | 2017-01-18 | False | 0 |

1 | 2019-04-14 | True | 0 |

1 | 2019-04-17 | True | 3 |

1 | 2019-04-19 | False | 0 |

1 | 2019-04-23 | True | 0 |

1 | 2019-04-25 | True | 2 |

1 | 2019-04-30 | True | 7 |

1 | 2019-05-21 | False | 0 |

This is so I can get the average truth duration (mean), per `project_id`

, such that I get a df like:

project_id | avg_duration |
---|---|

1 | 5 |

2 | 8 |

3 | 2 |

Can’t work out how to do this, thanks in advance!

## Answer

Solution for calculating `duration`

:

m = df['diag_local_code'] dt = df[m].groupby(['project_id', (~m).cumsum()])['start_datetime'].transform('first') df['duration'] = df['start_datetime'].sub(dt).dt.days.fillna(0)

How this works?

Use `cumsum`

on inverted `diag_local_code`

to identify groups of consecutive ones per `project_id`

, then filter the rows where `diag_local_code`

is `True`

then group the filtered dataframe and transform `start_datetime`

with `first`

to broadcast first date value across each group, finally subtract the broadcasted date value from `start_datetime`

to calculate the desired duration

Result

project_id start_datetime diag_local_code duration 0 1 2017-01-18 False 0.0 1 1 2019-04-14 True 0.0 2 1 2019-04-17 True 3.0 3 1 2019-04-19 False 0.0 4 1 2019-04-23 True 0.0 5 1 2019-04-25 True 2.0 6 1 2019-04-30 True 7.0 7 1 2019-05-21 False 0.0

Solution for calculating average consecutive duration of `True`

values

m = df['diag_local_code'] ( df[m].groupby(['project_id', (~m).cumsum()])['start_datetime'] .agg(np.ptp).dt.days.groupby(level=0).mean().reset_index(name='avg_duration') )

Result:

project_id avg_duration 0 1 5.0