I’m trying to speed up or vectorise the following slow code
def is_outlier(s): lower_limit = 0 upper_limit = s.mean() + (s.std() * 1) valid = s.between(lower_limit, upper_limit) s[~valid] = np.NaN return s df["Difference"] = df.groupby("CustomerName")["Difference"].transform(is_outlier)
The intent of the code is to remove any values that are beyond the upper standard deviation and replace with NaN.
I’ve tried to understand the following article about vectorisation but I can’t manage to apply it How to make your Pandas operation 100x faster by Yifei Huang
Advertisement
Answer
Using a custom function in .transform
can be very costly because Pandas can’t use the vectorized internal functions on the full dataframe. If you have a lot of groups (i.e. different customers) then this
upper_limit = ( df.groupby("CustomerName")["Difference"].transform("mean") + df.groupby("CustomerName")["Difference"].transform("std") ) df["Difference"] = df.loc[df["Difference"].between(0, upper_limit), "Difference"]
would be significantly faster. (I’ve timed dataframes with about 1,000 – 10,000 customers and 100,000 – 1,000,000 rows.)
Timing setup (test_1
is the solution in the question, test_2
the one here):
from random import seed, choice, randint from timeit import timeit def sample_df(num_customers, num_rows, s=123456): seed(s) customers = [*map(str, range(num_customers))] return pd.DataFrame( [[choice(customers), randint(-5, 10)] for _ in range(num_rows)], columns=["CustomerName", "Difference"] ) def is_outlier(s): lower_limit = 0 upper_limit = s.mean() + s.std() valid = s.between(lower_limit, upper_limit) s[~valid] = np.NaN return s def test_1(df): df["Difference"] = df.groupby("CustomerName")["Difference"].transform(is_outlier) return df def test_2(df): upper_limit = ( df.groupby("CustomerName")["Difference"].transform("mean") + df.groupby("CustomerName")["Difference"].transform("std") ) df["Difference"] = df.loc[df["Difference"].between(0, upper_limit), "Difference"] return df for num_customers, num_rows in (1_000, 100_000), (5_000, 100_000): print(f"Settings: {num_customers = }, {num_rows = }") for i in range(1, 3): df = sample_df(num_customers, num_rows) t = timeit(f"test_{i}(df)", globals=globals(), number=10) print(f"test_{i}: {t:.3f} seconds")
Result here:
Settings: num_customers = 1000, num_rows = 100000 test_1: 36.102 seconds test_2: 0.316 seconds Settings: num_customers = 5000, num_rows = 100000 test_1: 138.889 seconds test_2: 0.412 seconds