The goal is to fill values only between two values (start and end) with unique numbers (will be used in a groupby later on), notice how the values between end and start are still None in the desired output:
Code:
>>> df = pd.DataFrame(
dict(
flag=[None, 'start', None, None, 'end', 'start', 'end', None, 'start', None,'end',None],
)
)
>>> df
flag
0 None
1 start
2 None
3 None
4 end
5 start
6 end
7 None
8 start
9 None
10 end
11 None
Advertisement
Answer
Usually problems like these are solved by fiddling with cumsum and shift.
The main idea for this solution is to identify rows where the number of “starts” seen is ahead of the number of “ends” seen by one.
The only assumption I made is that 'start' and 'end' alternate, beginning with a 'start'.
>>> values = df['flag'].eq('start').cumsum()
>>> where = values.sub(1).eq(df['flag'].eq('end').cumsum().shift(1).fillna(0))
>>> df['flag_periods'] = df['flag'].mask(where, values)
>>> df
flag flag_periods
0 None None
1 start 1
2 None 1
3 None 1
4 end 1
5 start 2
6 end 2
7 None None
8 start 3
9 None 3
10 end 3
11 None None
Visualization:
>>> df['values'] = df.eq('start').cumsum()
>>> df['end_cumsum'] = df['flag'].eq('end').cumsum()
>>> df['end_cumsum_s1'] = df['end_cumsum'].shift(1).fillna(0)
>>> df['values-1'] = df['values'].sub(1)
>>> df['where'] = df['values-1'].eq(df['end_cumsum_s1'])
>>> df
flag values end_cumsum end_cumsum_s1 values-1 where
0 None 0 0 0.0 -1 False
1 start 1 0 0.0 0 True
2 None 1 0 0.0 0 True
3 None 1 0 0.0 0 True
4 end 1 1 0.0 0 True
5 start 2 1 1.0 1 True
6 end 2 2 1.0 1 True
7 None 2 2 2.0 1 False
8 start 3 2 2.0 2 True
9 None 3 2 2.0 2 True
10 end 3 3 2.0 2 True
11 None 3 3 3.0 2 False
edit: added .fillna(0) to account for dataframes where the first value in the 'flag' column is 'start'.
