I have a large dataset of over 2M rows with the following structure:
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame({
'name': ['Alex', 'Joe', 'Alex', 'Joe', 'Alex', 'Joe', 'Alex', 'Joe', 'Alex', 'Joe', 'Alex'],
'month': ['May', 'May', 'May', 'May', 'May', 'May', 'April', 'April', 'April', 'April', 'February'],
'variable': ['Cash', 'Cash', 'Debt', 'Debt', 'Others', 'Others', 'Cash', 'Cash', 'Debt', 'Debt', 'Cash'],
'value': np.random.randint(low=0, high=100, size=11)
})
name month variable value
0 Alex May Cash 44
1 Joe May Cash 47
2 Alex May Debt 64
3 Joe May Debt 67
4 Alex May Others 67
5 Joe May Others 9
6 Alex April Cash 83
7 Joe April Cash 21
8 Alex April Debt 36
9 Joe April Debt 87
10 Alex February Cash 70
If I wanted to calculate the net debt for each person at each month I would do this:
df.groupby(['name', 'month']).apply(lambda x: x[x['variable'] == 'Debt'].value - x[x['variable'] == 'Cash'].value)
name month
Alex April 6 NaN
8 NaN
February 10 NaN
May 0 NaN
2 NaN
Joe April 7 NaN
9 NaN
May 1 NaN
3 NaN
Name: value, dtype: float64
However the result is full of NA values, which I believe is a result of the dataframe not having the same amount of cash and debt variables for each person and month. Is there a way for me to avoid this and simply get the net debt for each month/person when possible and an NA for when it’s not?
Also, I’m kind of new to python and as I mentioned the dataset on which I’m working on is pretty large – so if anyone know a quicker/alternative method for this it would be greatly appreciated!
Advertisement
Answer
IIUC, use loc:
print (df.groupby(['name', 'month']).apply(lambda d: d.loc[d["variable"].eq("Debt"),"value"].sum()
- d.loc[d["variable"].eq("Cash"),"value"].sum()))
name month
Alex April 40
February -17
May -25
Joe April 5
May -38
dtype: int64