I have a large dataset of over 2M rows with the following structure:
JavaScript
x
24
24
1
import pandas as pd
2
import numpy as np
3
4
np.random.seed(0)
5
df = pd.DataFrame({
6
'name': ['Alex', 'Joe', 'Alex', 'Joe', 'Alex', 'Joe', 'Alex', 'Joe', 'Alex', 'Joe', 'Alex'],
7
'month': ['May', 'May', 'May', 'May', 'May', 'May', 'April', 'April', 'April', 'April', 'February'],
8
'variable': ['Cash', 'Cash', 'Debt', 'Debt', 'Others', 'Others', 'Cash', 'Cash', 'Debt', 'Debt', 'Cash'],
9
'value': np.random.randint(low=0, high=100, size=11)
10
})
11
12
name month variable value
13
0 Alex May Cash 44
14
1 Joe May Cash 47
15
2 Alex May Debt 64
16
3 Joe May Debt 67
17
4 Alex May Others 67
18
5 Joe May Others 9
19
6 Alex April Cash 83
20
7 Joe April Cash 21
21
8 Alex April Debt 36
22
9 Joe April Debt 87
23
10 Alex February Cash 70
24
If I wanted to calculate the net debt for each person at each month I would do this:
JavaScript
1
14
14
1
df.groupby(['name', 'month']).apply(lambda x: x[x['variable'] == 'Debt'].value - x[x['variable'] == 'Cash'].value)
2
3
name month
4
Alex April 6 NaN
5
8 NaN
6
February 10 NaN
7
May 0 NaN
8
2 NaN
9
Joe April 7 NaN
10
9 NaN
11
May 1 NaN
12
3 NaN
13
Name: value, dtype: float64
14
However the result is full of NA values, which I believe is a result of the dataframe not having the same amount of cash and debt variables for each person and month. Is there a way for me to avoid this and simply get the net debt for each month/person when possible and an NA for when it’s not?
Also, I’m kind of new to python and as I mentioned the dataset on which I’m working on is pretty large – so if anyone know a quicker/alternative method for this it would be greatly appreciated!
Advertisement
Answer
IIUC, use loc
:
JavaScript
1
11
11
1
print (df.groupby(['name', 'month']).apply(lambda d: d.loc[d["variable"].eq("Debt"),"value"].sum()
2
- d.loc[d["variable"].eq("Cash"),"value"].sum()))
3
4
name month
5
Alex April 40
6
February -17
7
May -25
8
Joe April 5
9
May -38
10
dtype: int64
11