I am quite new to pandas, but I use python at a good level.
I have a pandas dataframe which is organized as follows
JavaScript
x
20
20
1
idrun idbasin time q
2
-192540 1 0 0
3
-192540 1 1 0.5
4
5
-192540 2 0 0
6
-192540 2 1 1
7
8
-192540 3 0 0
9
-192540 3 1 1
10
11
-192541 1 0 0
12
-192541 1 1 0.5
13
14
-192541 2 0 0
15
-192541 2 1 1
16
17
-192541 3 0 0
18
-192541 3 1 1
19
20
It is a fairly large dataframe (7 columns and ~600k rows).
What I would like to do is: given a tuple containing values referring to the idbasin
column (e.g. (1,2)
), if the idrun
value is the same
- sum the
q
column of the referredidbasin
values, i.e. for the example it would be(1,2)
- remove the rows corresponding to that
idrun
value and the tuple-specifiedidbasin
values - insert the summed values with
idbasin
equal to the first number of the tuple.
Referring to my example df, the results would be
JavaScript
1
14
14
1
idrun idbasin time q
2
-192540 1 0 0
3
-192540 1 1 1.5
4
5
-192540 3 0 0
6
-192540 3 1 1
7
8
-192541 1 0 0
9
-192541 1 1 1.5
10
11
-192541 3 0 0
12
-192541 3 1 1
13
14
My solution would to use groupby
to turn the df
to a dict
and then do the operation with one or two for loops, but I understand that iterating in pandas is not the optimal solution, so I believe there could be a “pandas” solution using the df
.
Advertisement
Answer
You can replace values of tuple by first value of tuple in Series.mask
and then aggregate sum
:
JavaScript
1
17
17
1
tup = (1, 2)
2
3
df['idbasin'] = df['idbasin'].mask(df['idbasin'].isin(tup), tup[0])
4
#alternative
5
#df['idbasin'] = np.where(df['idbasin'].isin(tup), tup[0], df['idbasin'])
6
df = df.groupby(['idrun', 'idbasin','time'], as_index=False)['q'].sum()
7
print (df)
8
idrun idbasin time q
9
0 -192541 1 0 0.0
10
1 -192541 1 1 1.5
11
2 -192541 3 0 0.0
12
3 -192541 3 1 1.0
13
4 -192540 1 0 0.0
14
5 -192540 1 1 1.5
15
6 -192540 3 0 0.0
16
7 -192540 3 1 1.0
17