I’m trying to speed up or vectorise the following slow code
def is_outlier(s):
lower_limit = 0
upper_limit = s.mean() + (s.std() * 1)
valid = s.between(lower_limit, upper_limit)
s[~valid] = np.NaN
return s
df["Difference"] = df.groupby("CustomerName")["Difference"].transform(is_outlier)
The intent of the code is to remove any values that are beyond the upper standard deviation and replace with NaN.
I’ve tried to understand the following article about vectorisation but I can’t manage to apply it How to make your Pandas operation 100x faster by Yifei Huang
Advertisement
Answer
Using a custom function in .transform can be very costly because Pandas can’t use the vectorized internal functions on the full dataframe. If you have a lot of groups (i.e. different customers) then this
upper_limit = (
df.groupby("CustomerName")["Difference"].transform("mean")
+ df.groupby("CustomerName")["Difference"].transform("std")
)
df["Difference"] = df.loc[df["Difference"].between(0, upper_limit), "Difference"]
would be significantly faster. (I’ve timed dataframes with about 1,000 – 10,000 customers and 100,000 – 1,000,000 rows.)
Timing setup (test_1 is the solution in the question, test_2 the one here):
from random import seed, choice, randint
from timeit import timeit
def sample_df(num_customers, num_rows, s=123456):
seed(s)
customers = [*map(str, range(num_customers))]
return pd.DataFrame(
[[choice(customers), randint(-5, 10)] for _ in range(num_rows)],
columns=["CustomerName", "Difference"]
)
def is_outlier(s):
lower_limit = 0
upper_limit = s.mean() + s.std()
valid = s.between(lower_limit, upper_limit)
s[~valid] = np.NaN
return s
def test_1(df):
df["Difference"] = df.groupby("CustomerName")["Difference"].transform(is_outlier)
return df
def test_2(df):
upper_limit = (
df.groupby("CustomerName")["Difference"].transform("mean")
+ df.groupby("CustomerName")["Difference"].transform("std")
)
df["Difference"] = df.loc[df["Difference"].between(0, upper_limit), "Difference"]
return df
for num_customers, num_rows in (1_000, 100_000), (5_000, 100_000):
print(f"Settings: {num_customers = }, {num_rows = }")
for i in range(1, 3):
df = sample_df(num_customers, num_rows)
t = timeit(f"test_{i}(df)", globals=globals(), number=10)
print(f"test_{i}: {t:.3f} seconds")
Result here:
Settings: num_customers = 1000, num_rows = 100000 test_1: 36.102 seconds test_2: 0.316 seconds Settings: num_customers = 5000, num_rows = 100000 test_1: 138.889 seconds test_2: 0.412 seconds