Identify and count segments between a start and an end marker

Question

The goal is to fill values only between two values (start and end) with unique numbers (will be used in a groupby later on), notice how the values between end and start are still None in the desired output: Code: Answer Usually problems like these are solved by fiddling with cumsum and shift. The main idea fo…

Accepted Answer

Usually problems like these are solved by fiddling with cumsum and shift.The main idea for this solution is to identify rows where the number of &#8220;starts&#8221; seen is ahead of the number of &#8220;ends&#8221; seen by one.The only assumption I made is that 'start' and 'end' alternate, beginning with a 'start'.>>> values = df['flag'].eq('start').cumsum()>>> where = values.sub(1).eq(df['flag'].eq('end').cumsum().shift(1).fillna(0))>>> df['flag_periods'] = df['flag'].mask(where, values)>>> df      flag flag_periods0    None         None1   start            12    None            13    None            14     end            15   start            26     end            27    None         None8   start            39    None            310    end            311   None         NoneVisualization:>>> df['values'] = df.eq('start').cumsum()>>> df['end_cumsum'] = df['flag'].eq('end').cumsum()>>> df['end_cumsum_s1'] = df['end_cumsum'].shift(1).fillna(0)>>> df['values-1'] = df['values'].sub(1)>>> df['where'] = df['values-1'].eq(df['end_cumsum_s1'])>>> df      flag  values  end_cumsum  end_cumsum_s1  values-1  where0    None       0           0            0.0        -1  False1   start       1           0            0.0         0   True2    None       1           0            0.0         0   True3    None       1           0            0.0         0   True4     end       1           1            0.0         0   True5   start       2           1            1.0         1   True6     end       2           2            1.0         1   True7    None       2           2            2.0         1  False8   start       3           2            2.0         2   True9    None       3           2            2.0         2   True10    end       3           3            2.0         2   True11   None       3           3            3.0         2  Falseedit: added .fillna(0) to account for dataframes where the first value in the 'flag' column is 'start'.

Advertisement

Answer