So I have a data frame like this–
JavaScript
x
5
1
import pandas as pd
2
import numpy as np
3
4
df = pd.DataFrame([[1,22], [1,23], [1,44], [2, 33], [2, 55]], columns=['id', 'delay'])
5
JavaScript
1
7
1
id delay
2
0 1 22
3
1 1 23
4
2 1 44
5
3 2 33
6
4 2 55
7
What I am doing is grouping by id
and doing rolling operation on the delay
column like below–
JavaScript
1
8
1
k = [0.1, 0.5, 1]
2
3
def f(d):
4
d['new_delay'] = pd.Series([0,0]).append(d['delay']).rolling(window=3).apply(lambda x: np.sum(x*k)).iloc[2:]
5
return d
6
7
df.groupby(['id']).apply(f)
8
JavaScript
1
7
1
id delay new_delay
2
0 1 22 22.0
3
1 1 23 34.0
4
2 1 44 57.7
5
3 2 33 33.0
6
4 2 55 71.5
7
It is working just fine but I am curious whether .apply
on grouped data frame is vectorized or not. Since my dataset is huge, is there a better-vectorized way to do this kind of operation? Also I am curious if Python is single-threaded and I am running on CPU how pandas, numpy achieve vectorized calculation.
Advertisement
Answer
You can use strides for vectorized rolling
with GroupBy.transform
:
JavaScript
1
20
20
1
k = [0.1, 0.5, 1]
2
3
def rolling_window(a, window):
4
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
5
strides = a.strides + (a.strides[-1],)
6
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
7
8
9
def f(d):
10
return np.sum(rolling_window(np.append([0,0],d.to_numpy()), 3) * k, axis=1)
11
12
df['new_delay'] = df.groupby('id')['delay'].transform(f)
13
print (df)
14
id delay new_delay
15
0 1 22 22.0
16
1 1 23 34.0
17
2 1 44 57.7
18
3 2 33 33.0
19
4 2 55 71.5
20