Skip to content
Advertisement

How to speed up pandas transform function?

I’m trying to speed up or vectorise the following slow code

    def is_outlier(s):
        lower_limit = 0
        upper_limit = s.mean() + (s.std() * 1)
        valid = s.between(lower_limit, upper_limit)
        s[~valid] = np.NaN
        return s

    df["Difference"] = df.groupby("CustomerName")["Difference"].transform(is_outlier)

The intent of the code is to remove any values that are beyond the upper standard deviation and replace with NaN.

I’ve tried to understand the following article about vectorisation but I can’t manage to apply it How to make your Pandas operation 100x faster by Yifei Huang

Advertisement

Answer

Using a custom function in .transform can be very costly because Pandas can’t use the vectorized internal functions on the full dataframe. If you have a lot of groups (i.e. different customers) then this

upper_limit = (
    df.groupby("CustomerName")["Difference"].transform("mean")
    + df.groupby("CustomerName")["Difference"].transform("std")
)
df["Difference"] = df.loc[df["Difference"].between(0, upper_limit), "Difference"]

would be significantly faster. (I’ve timed dataframes with about 1,000 – 10,000 customers and 100,000 – 1,000,000 rows.)


Timing setup (test_1 is the solution in the question, test_2 the one here):

from random import seed, choice, randint
from timeit import timeit

def sample_df(num_customers, num_rows, s=123456):
    seed(s)
    customers = [*map(str, range(num_customers))]
    return pd.DataFrame(
        [[choice(customers), randint(-5, 10)] for _ in range(num_rows)],
        columns=["CustomerName", "Difference"]
    )

def is_outlier(s):
    lower_limit = 0
    upper_limit = s.mean() + s.std()
    valid = s.between(lower_limit, upper_limit)
    s[~valid] = np.NaN
    return s

def test_1(df):
    df["Difference"] = df.groupby("CustomerName")["Difference"].transform(is_outlier)
    return df

def test_2(df):
    upper_limit = (
        df.groupby("CustomerName")["Difference"].transform("mean")
        + df.groupby("CustomerName")["Difference"].transform("std")
    )
    df["Difference"] = df.loc[df["Difference"].between(0, upper_limit), "Difference"]
    return df

for num_customers, num_rows in (1_000, 100_000), (5_000, 100_000):
    print(f"Settings: {num_customers = }, {num_rows = }")
    for i in range(1, 3):
        df = sample_df(num_customers, num_rows)
        t = timeit(f"test_{i}(df)", globals=globals(), number=10)
        print(f"test_{i}: {t:.3f} seconds")

Result here:

Settings: num_customers = 1000, num_rows = 100000
test_1: 36.102 seconds
test_2: 0.316 seconds
Settings: num_customers = 5000, num_rows = 100000
test_1: 138.889 seconds
test_2: 0.412 seconds
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement