How to efficiently do operation on pandas each group

So I have a data frame like this–

import pandas as pd
import numpy as np

df = pd.DataFrame([[1,22], [1,23], [1,44], [2, 33], [2, 55]], columns=['id', 'delay'])

JavaScript
​x
 
import pandas as pd
import numpy as np
​
df = pd.DataFrame([[1,22], [1,23], [1,44], [2, 33], [2, 55]], columns=['id', 'delay'])
​

JavaScript
 
id  delay
0   1   22
1   1   23
2   1   44
3   2   33
4   2   55
​

What I am doing is grouping by id and doing rolling operation on the delay column like below–

k = [0.1, 0.5, 1]

def f(d):
    d['new_delay'] = pd.Series([0,0]).append(d['delay']).rolling(window=3).apply(lambda x: np.sum(x*k)).iloc[2:]
    return d

df.groupby(['id']).apply(f)

JavaScript
 
k = [0.1, 0.5, 1]
​
def f(d):
    d['new_delay'] = pd.Series([0,0]).append(d['delay']).rolling(window=3).apply(lambda x: np.sum(x*k)).iloc[2:]
    return d
​
df.groupby(['id']).apply(f)
​

   id   delay   new_delay
0   1   22  22.0
1   1   23  34.0
2   1   44  57.7
3   2   33  33.0
4   2   55  71.5

JavaScript
 
   id   delay   new_delay
0   1   22  22.0
1   1   23  34.0
2   1   44  57.7
3   2   33  33.0
4   2   55  71.5
​

It is working just fine but I am curious whether .apply on grouped data frame is vectorized or not. Since my dataset is huge, is there a better-vectorized way to do this kind of operation? Also I am curious if Python is single-threaded and I am running on CPU how pandas, numpy achieve vectorized calculation.

Answer

You can use strides for vectorized rolling with GroupBy.transform:

k = [0.1, 0.5, 1]

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)


def f(d):
    return np.sum(rolling_window(np.append([0,0],d.to_numpy()), 3) * k, axis=1)

df['new_delay'] = df.groupby('id')['delay'].transform(f)
print (df)
   id  delay  new_delay
0   1     22       22.0
1   1     23       34.0
2   1     44       57.7
3   2     33       33.0
4   2     55       71.5

JavaScript
 
k = [0.1, 0.5, 1]
​
def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
​
​
def f(d):
    return np.sum(rolling_window(np.append([0,0],d.to_numpy()), 3) * k, axis=1)
​
df['new_delay'] = df.groupby('id')['delay'].transform(f)
print (df)
   id  delay  new_delay
0   1     22       22.0
1   1     23       34.0
2   1     44       57.7
3   2     33       33.0
4   2     55       71.5
​

Advertisement

Answer