The goal is to fill values only between two values (start
and end
) with unique numbers (will be used in a groupby
later on), notice how the values between end
and start
are still None
in the desired output:
Code:
JavaScript
x
22
22
1
>>> df = pd.DataFrame(
2
dict(
3
flag=[None, 'start', None, None, 'end', 'start', 'end', None, 'start', None,'end',None],
4
)
5
)
6
7
>>> df
8
flag
9
0 None
10
1 start
11
2 None
12
3 None
13
4 end
14
5 start
15
6 end
16
7 None
17
8 start
18
9 None
19
10 end
20
11 None
21
22
Advertisement
Answer
Usually problems like these are solved by fiddling with cumsum
and shift
.
The main idea for this solution is to identify rows where the number of “starts” seen is ahead of the number of “ends” seen by one.
The only assumption I made is that 'start'
and 'end'
alternate, beginning with a 'start'
.
JavaScript
1
18
18
1
>>> values = df['flag'].eq('start').cumsum()
2
>>> where = values.sub(1).eq(df['flag'].eq('end').cumsum().shift(1).fillna(0))
3
>>> df['flag_periods'] = df['flag'].mask(where, values)
4
>>> df
5
flag flag_periods
6
0 None None
7
1 start 1
8
2 None 1
9
3 None 1
10
4 end 1
11
5 start 2
12
6 end 2
13
7 None None
14
8 start 3
15
9 None 3
16
10 end 3
17
11 None None
18
Visualization:
JavaScript
1
20
20
1
>>> df['values'] = df.eq('start').cumsum()
2
>>> df['end_cumsum'] = df['flag'].eq('end').cumsum()
3
>>> df['end_cumsum_s1'] = df['end_cumsum'].shift(1).fillna(0)
4
>>> df['values-1'] = df['values'].sub(1)
5
>>> df['where'] = df['values-1'].eq(df['end_cumsum_s1'])
6
>>> df
7
flag values end_cumsum end_cumsum_s1 values-1 where
8
0 None 0 0 0.0 -1 False
9
1 start 1 0 0.0 0 True
10
2 None 1 0 0.0 0 True
11
3 None 1 0 0.0 0 True
12
4 end 1 1 0.0 0 True
13
5 start 2 1 1.0 1 True
14
6 end 2 2 1.0 1 True
15
7 None 2 2 2.0 1 False
16
8 start 3 2 2.0 2 True
17
9 None 3 2 2.0 2 True
18
10 end 3 3 2.0 2 True
19
11 None 3 3 3.0 2 False
20
edit: added .fillna(0)
to account for dataframes where the first value in the 'flag'
column is 'start'
.