I have a large dataset of over 2M rows with the following structure: If I wanted to calculate the net debt for each person at each month I would do this: However the result is full of NA values, which I believe is a result of the dataframe not having the same amount of cash and debt variables for each

Pandas: groupby().apply() custom function when groups variables aren’t the same length?

I have a large dataset of over 2M rows with the following structure:

import pandas as pd
import numpy as np

np.random.seed(0)
df = pd.DataFrame({
    'name': ['Alex', 'Joe', 'Alex', 'Joe', 'Alex', 'Joe', 'Alex', 'Joe', 'Alex', 'Joe', 'Alex'],
    'month': ['May', 'May', 'May', 'May', 'May', 'May', 'April', 'April', 'April', 'April', 'February'],
    'variable': ['Cash', 'Cash', 'Debt', 'Debt', 'Others', 'Others', 'Cash', 'Cash', 'Debt', 'Debt', 'Cash'],
    'value': np.random.randint(low=0, high=100, size=11)
})

        name     month variable  value
0   Alex       May     Cash     44
1    Joe       May     Cash     47
2   Alex       May     Debt     64
3    Joe       May     Debt     67
4   Alex       May   Others     67
5    Joe       May   Others      9
6   Alex     April     Cash     83
7    Joe     April     Cash     21
8   Alex     April     Debt     36
9    Joe     April     Debt     87
10  Alex  February     Cash     70

JavaScript
​x
 
import pandas as pd
import numpy as np
​
np.random.seed(0)
df = pd.DataFrame({
    'name': ['Alex', 'Joe', 'Alex', 'Joe', 'Alex', 'Joe', 'Alex', 'Joe', 'Alex', 'Joe', 'Alex'],
    'month': ['May', 'May', 'May', 'May', 'May', 'May', 'April', 'April', 'April', 'April', 'February'],
    'variable': ['Cash', 'Cash', 'Debt', 'Debt', 'Others', 'Others', 'Cash', 'Cash', 'Debt', 'Debt', 'Cash'],
    'value': np.random.randint(low=0, high=100, size=11)
})
​
        name     month variable  value
0   Alex       May     Cash     44
1    Joe       May     Cash     47
2   Alex       May     Debt     64
3    Joe       May     Debt     67
4   Alex       May   Others     67
5    Joe       May   Others      9
6   Alex     April     Cash     83
7    Joe     April     Cash     21
8   Alex     April     Debt     36
9    Joe     April     Debt     87
10  Alex  February     Cash     70
​

If I wanted to calculate the net debt for each person at each month I would do this:

df.groupby(['name', 'month']).apply(lambda x: x[x['variable'] == 'Debt'].value - x[x['variable'] == 'Cash'].value)

name  month
Alex  April     6    NaN
                8    NaN
      February  10   NaN
      May       0    NaN
                2    NaN
Joe   April     7    NaN
                9    NaN
      May       1    NaN
                3    NaN
Name: value, dtype: float64

JavaScript
 
df.groupby(['name', 'month']).apply(lambda x: x[x['variable'] == 'Debt'].value - x[x['variable'] == 'Cash'].value)
​
name  month
Alex  April     6    NaN
                8    NaN
      February  10   NaN
      May       0    NaN
                2    NaN
Joe   April     7    NaN
                9    NaN
      May       1    NaN
                3    NaN
Name: value, dtype: float64 
​

However the result is full of NA values, which I believe is a result of the dataframe not having the same amount of cash and debt variables for each person and month. Is there a way for me to avoid this and simply get the net debt for each month/person when possible and an NA for when it’s not?

Also, I’m kind of new to python and as I mentioned the dataset on which I’m working on is pretty large – so if anyone know a quicker/alternative method for this it would be greatly appreciated!

Answer

IIUC, use loc:

print (df.groupby(['name', 'month']).apply(lambda d: d.loc[d["variable"].eq("Debt"),"value"].sum() 
                                                     - d.loc[d["variable"].eq("Cash"),"value"].sum()))

name  month   
Alex  April       40
      February   -17
      May        -25
Joe   April        5
      May        -38
dtype: int64

JavaScript
 
print (df.groupby(['name', 'month']).apply(lambda d: d.loc[d["variable"].eq("Debt"),"value"].sum() 
                                                     - d.loc[d["variable"].eq("Cash"),"value"].sum()))
​
name  month   
Alex  April       40
      February   -17
      May        -25
Joe   April        5
      May        -38
dtype: int64
​

Advertisement

Answer