I have a large dataset of over 2M rows with the following structure:
import pandas as pd import numpy as np np.random.seed(0) df = pd.DataFrame({ 'name': ['Alex', 'Joe', 'Alex', 'Joe', 'Alex', 'Joe', 'Alex', 'Joe', 'Alex', 'Joe', 'Alex'], 'month': ['May', 'May', 'May', 'May', 'May', 'May', 'April', 'April', 'April', 'April', 'February'], 'variable': ['Cash', 'Cash', 'Debt', 'Debt', 'Others', 'Others', 'Cash', 'Cash', 'Debt', 'Debt', 'Cash'], 'value': np.random.randint(low=0, high=100, size=11) }) name month variable value 0 Alex May Cash 44 1 Joe May Cash 47 2 Alex May Debt 64 3 Joe May Debt 67 4 Alex May Others 67 5 Joe May Others 9 6 Alex April Cash 83 7 Joe April Cash 21 8 Alex April Debt 36 9 Joe April Debt 87 10 Alex February Cash 70
If I wanted to calculate the net debt for each person at each month I would do this:
df.groupby(['name', 'month']).apply(lambda x: x[x['variable'] == 'Debt'].value - x[x['variable'] == 'Cash'].value) name month Alex April 6 NaN 8 NaN February 10 NaN May 0 NaN 2 NaN Joe April 7 NaN 9 NaN May 1 NaN 3 NaN Name: value, dtype: float64
However the result is full of NA values, which I believe is a result of the dataframe not having the same amount of cash and debt variables for each person and month. Is there a way for me to avoid this and simply get the net debt for each month/person when possible and an NA for when it’s not?
Also, I’m kind of new to python and as I mentioned the dataset on which I’m working on is pretty large – so if anyone know a quicker/alternative method for this it would be greatly appreciated!
Advertisement
Answer
IIUC, use loc
:
print (df.groupby(['name', 'month']).apply(lambda d: d.loc[d["variable"].eq("Debt"),"value"].sum() - d.loc[d["variable"].eq("Cash"),"value"].sum())) name month Alex April 40 February -17 May -25 Joe April 5 May -38 dtype: int64