I have a dataframe like this, I want to calculate and add a new column which follows the formula: Value = A(where Time=1) + A(where Time=3)
, I don’t want to use A (where Time=5).
JavaScript
x
14
14
1
Type subType Time A Value
2
X a 1 3 =3+9=12
3
X a 3 9
4
X a 5 9
5
X b 1 4 =4+5=9
6
X b 3 5
7
X b 5 0
8
Y a 1 1 =1+2=3
9
Y a 3 2
10
Y a 5 3
11
Y b 1 4 =4+5=9
12
Y b 3 5
13
Y b 5 2
14
I know how to do by selecting the cell needed for the formula, but is there any other better ways to perform the calculation? I suspect I need to add a condition but not sure how, any suggestion?
Advertisement
Answer
Use Series.eq
with DataFrame.groupby
and Series.cumsum
to create groups and add.
JavaScript
1
9
1
c1 = df.Time.eq(1)
2
c3 = df.Time.eq(3)
3
df['Value'] = (df.loc[c1|c3]
4
.groupby(c1.cumsum())
5
.A
6
.transform('sum')
7
.loc[c1])
8
print(df)
9
or if you want to identify it based on the non-equivalence with 5:
JavaScript
1
14
14
1
c = df['Time'].eq(5)
2
df['value'] = (df['A'].mask(c)
3
.groupby(c.cumsum())
4
.transform('sum')
5
.where(c.shift(fill_value = True))
6
)
7
#Another option is map
8
c = df['Time'].eq(5)
9
c_cumsum = c.cumsum()
10
df['value'] = (c_cumsum.map(df['A'].mask(c)
11
.groupby(c_cumsum)
12
.sum())
13
.where(c.shift(fill_value = True)))
14
Output
JavaScript
1
14
14
1
Type subType Time A Value
2
0 X a 1 3 12.0
3
1 X a 3 9 NaN
4
2 X a 5 9 NaN
5
3 X b 1 4 9.0
6
4 X b 3 5 NaN
7
5 X b 5 0 NaN
8
6 Y a 1 1 3.0
9
7 Y a 3 2 NaN
10
8 Y a 5 3 NaN
11
9 Y b 1 4 9.0
12
10 Y b 3 5 NaN
13
11 Y b 5 2 NaN
14
MISSING VALUES
JavaScript
1
16
16
1
c = df['Time'].eq(5)
2
df['value'] = (df['A'].mask(c)
3
.groupby(c.cumsum())
4
.transform('sum')
5
6
)
7
#or method 1
8
#c1 = df.Time.eq(1)
9
#c3 = df.Time.eq(3)
10
#df['Value'] = (df.loc[c1|c3]
11
# .groupby(c1.cumsum())
12
# .A
13
# .transform('sum')
14
# )
15
print(df)
16
Output
JavaScript
1
14
14
1
Type subType Time A value
2
0 X a 1 3 12.0
3
1 X a 3 9 12.0
4
2 X a 5 9 9.0
5
3 X b 1 4 9.0
6
4 X b 3 5 9.0
7
5 X b 5 0 3.0
8
6 Y a 1 1 3.0
9
7 Y a 3 2 3.0
10
8 Y a 5 3 9.0
11
9 Y b 1 4 9.0
12
10 Y b 3 5 9.0
13
11 Y b 5 2 0.0
14
or filling all except where Time is 5
JavaScript
1
28
28
1
c = df['Time'].eq(5)
2
df['value'] = (df['A'].mask(c)
3
.groupby(c.cumsum())
4
.transform('sum').mask(c))
5
6
#c1 = df.Time.eq(1)
7
#c3 = df.Time.eq(3)
8
#or method 1
9
#df['Value'] = (df.loc[c1|c3]
10
# .groupby(c1.cumsum())
11
# .A
12
# .transform('sum')
13
# .loc[c1|c3])
14
print(df)
15
Type subType Time A value
16
0 X a 1 3 12.0
17
1 X a 3 9 12.0
18
2 X a 5 9 NaN
19
3 X b 1 4 9.0
20
4 X b 3 5 9.0
21
5 X b 5 0 NaN
22
6 Y a 1 1 3.0
23
7 Y a 3 2 3.0
24
8 Y a 5 3 NaN
25
9 Y b 1 4 9.0
26
10 Y b 3 5 9.0
27
11 Y b 5 2 NaN
28
Why not use apply here?
Even in a small data frame it is already slower
JavaScript
1
20
20
1
%%timeit
2
3
(
4
df.groupby(by=['Type','subType'])
5
.apply(lambda x: x.loc[x.Time!=5].A.sum()) # sum time each group exclu
6
.to_frame('Value').reset_index()
7
.pipe(lambda x: pd.merge(df, x, on=['Type', 'subType'], how='left'))
8
)
9
13.6 ms ± 2.67 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
10
11
%%timeit
12
c = df['Time'].eq(5)
13
df['value'] = (df['A'].mask(c)
14
.groupby(c.cumsum())
15
.transform('sum')
16
.where(c.shift(fill_value = True))
17
)
18
19
3.67 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
20