I’m trying to calculate (x-x.mean()) / (x.std +0.01) on several columns of a dataframe based on groups. My original dataframe is very large. Although I’ve splitted the original file into several chunks and I’m using multiprocessing to run the script on each chunk of the file, but still every chunk of the dataframe is very large and this process never finishes.
I used the following code:
df[val_cols] = df.groupby(['user_id', 'category', 'date'])[val_cols].apply(lambda x: (x - x.mean()) / (x.std() + 0.01))
Based on my experience groupby, apply and join are not efficient for large dataframes, so I would like to find a way to replace the groupby and the apply functions.
Does anyone know a better way for vectrozing this process, instead of using groupby and apply?
I’m also not looking for a multiprocessing libraries such as pandarallel, swifter or dask because I’ve tried those and they didn’t help me.
Sample df:
df = pd.DataFrame(columns=['user_id', 'category', 'date', 'item_id', 'value_1_diff', 'value_2_diff', 'value_3_diff'], data=[[4,8,'10/5/2019', 'hi', 5, 13, 21], [5,10,'1/12/2019', 'fine', 6, 15, 11], [4,8,'10/5/2019', 'hello', 5, 13, 3], [5,10,'1/12/2019', 'hi', 7, 8, 9], [3,8,'10/5/2019', 'ok', 5,7, 13]]) user_id category date item_id value_1_diff value_2_diff value_3_diff 0 4 8 10/5/2019 hi 5 13 21 1 5 10 1/12/2019 fine 6 15 11 2 4 8 10/5/2019 hello 5 13 3 3 5 10 1/12/2019 hi 7 8 9 4 3 8 10/5/2019 ok 5 7 13
Advertisement
Answer
Not sure about performance, but here you can use GroupBy.transform
:
val_cols = ['value_1_diff', 'value_2_diff', 'value_3_diff'] g = df.groupby(['user_id', 'category', 'date'])[val_cols] df[val_cols] = df[val_cols].sub(g.transform('min')).div(g.transform('std') + 0.01)