The goal is to fill values only between two values (start
and end
) with unique numbers (will be used in a groupby
later on), notice how the values between end
and start
are still None
in the desired output:
Code:
>>> df = pd.DataFrame( dict( flag=[None, 'start', None, None, 'end', 'start', 'end', None, 'start', None,'end',None], ) ) >>> df flag 0 None 1 start 2 None 3 None 4 end 5 start 6 end 7 None 8 start 9 None 10 end 11 None
Advertisement
Answer
Usually problems like these are solved by fiddling with cumsum
and shift
.
The main idea for this solution is to identify rows where the number of “starts” seen is ahead of the number of “ends” seen by one.
The only assumption I made is that 'start'
and 'end'
alternate, beginning with a 'start'
.
>>> values = df['flag'].eq('start').cumsum() >>> where = values.sub(1).eq(df['flag'].eq('end').cumsum().shift(1).fillna(0)) >>> df['flag_periods'] = df['flag'].mask(where, values) >>> df flag flag_periods 0 None None 1 start 1 2 None 1 3 None 1 4 end 1 5 start 2 6 end 2 7 None None 8 start 3 9 None 3 10 end 3 11 None None
Visualization:
>>> df['values'] = df.eq('start').cumsum() >>> df['end_cumsum'] = df['flag'].eq('end').cumsum() >>> df['end_cumsum_s1'] = df['end_cumsum'].shift(1).fillna(0) >>> df['values-1'] = df['values'].sub(1) >>> df['where'] = df['values-1'].eq(df['end_cumsum_s1']) >>> df flag values end_cumsum end_cumsum_s1 values-1 where 0 None 0 0 0.0 -1 False 1 start 1 0 0.0 0 True 2 None 1 0 0.0 0 True 3 None 1 0 0.0 0 True 4 end 1 1 0.0 0 True 5 start 2 1 1.0 1 True 6 end 2 2 1.0 1 True 7 None 2 2 2.0 1 False 8 start 3 2 2.0 2 True 9 None 3 2 2.0 2 True 10 end 3 3 2.0 2 True 11 None 3 3 3.0 2 False
edit: added .fillna(0)
to account for dataframes where the first value in the 'flag'
column is 'start'
.