How to speed up pandas transform function?

Question

I&#8217;m trying to speed up or vectorise the following slow code The intent of the code is to remove any values that are beyond the upper standard deviation and replace with NaN. I&#8217;ve tried to understand the following article about vectorisation but I can&#8217;t manage to apply it How to make your Pan…

Accepted Answer

Using a custom function in .transform can be very costly because Pandas can&#8217;t use the vectorized internal functions on the full dataframe. If you have a lot of groups (i.e. different customers) then thisupper_limit = (    df.groupby("CustomerName")["Difference"].transform("mean")    + df.groupby("CustomerName")["Difference"].transform("std"))df["Difference"] = df.loc[df["Difference"].between(0, upper_limit), "Difference"]would be significantly faster. (I&#8217;ve timed dataframes with about 1,000 &#8211; 10,000 customers and 100,000 &#8211; 1,000,000 rows.)Timing setup (test_1 is the solution in the question, test_2 the one here):from random import seed, choice, randintfrom timeit import timeitdef sample_df(num_customers, num_rows, s=123456):    seed(s)    customers = [*map(str, range(num_customers))]    return pd.DataFrame(        [[choice(customers), randint(-5, 10)] for _ in range(num_rows)],        columns=["CustomerName", "Difference"]    )def is_outlier(s):    lower_limit = 0    upper_limit = s.mean() + s.std()    valid = s.between(lower_limit, upper_limit)    s[~valid] = np.NaN    return sdef test_1(df):    df["Difference"] = df.groupby("CustomerName")["Difference"].transform(is_outlier)    return dfdef test_2(df):    upper_limit = (        df.groupby("CustomerName")["Difference"].transform("mean")        + df.groupby("CustomerName")["Difference"].transform("std")    )    df["Difference"] = df.loc[df["Difference"].between(0, upper_limit), "Difference"]    return dffor num_customers, num_rows in (1_000, 100_000), (5_000, 100_000):    print(f"Settings: {num_customers = }, {num_rows = }")    for i in range(1, 3):        df = sample_df(num_customers, num_rows)        t = timeit(f"test_{i}(df)", globals=globals(), number=10)        print(f"test_{i}: {t:.3f} seconds")Result here:Settings: num_customers = 1000, num_rows = 100000test_1: 36.102 secondstest_2: 0.316 secondsSettings: num_customers = 5000, num_rows = 100000test_1: 138.889 secondstest_2: 0.412 seconds

Advertisement

Answer