How to speed up successive pd.apply with successive pd.DataFrame.loc calls?

Question

df has 10,000+ lines, so this code is taking a long time. In addition for each row, I'm doing a df_hist.loc call to get the value. I'm trying to speed up this section of code and then option I've found so far is using: But this forces me to use index based selection for row instead of value selection: which

Accepted Answer

In python, there is a certain cost for each attribute or item lookup and function call. And you don’t have a compiler that optimizes things for you.Here are some general recommendations:Try creating a column that includes fund and share_class without using Python functions and then merge it with df_hist# convert history from 'wide' format into 'long' formathist = df_hist.set_index("date").stack()prices = ( # create key column for join df.assign(key=df["fund"] + df["share_class"].replace({"X": "A", "Y": "B"})) .set_index(["date", "key"]) .join(hist) # join by index)If it’s not trivial to create a key column, minimize attribute lookups inside the apply function:def __link_price(row): date, fund, share_class = row[["date", "fund", "share_class"]] if fund == 'A' and share_class == 'X': return df_hist.loc[date, 'AA'] ...Optimize if conditions. For example, you need to check 6 conditions in case where (row['fund'] == 'C') and (row['share_class'] == 'Y'). You can reduce this number to … 1.fund_and_share_class_to_key = { ("A", "X"): "AA", ("A", "Y"): "AB", ...}key = fund_and_share_class_to_key.get((fund, share_class))return df_hist.loc[date, key] if key is not None else 0Pandas itself is pretty slow for non-vectorized and non-arithmetic operations. In your case it’s better to use standard python dicts for faster lookups.# small benchmarkdf = pd.DataFrame({"value": [4,5,6]})d = df.to_dict(orient="index")%timeit df.loc[1, "value"] # 8.7ms%timeit d[1]["value"] # 50ns; ~170 times faster# convert dataframe into the dict with format:# {: {"AA": }}history = df_hist.set_index("date").to_dict(orient="index")def __link_price(row): ... price = history.get(date, {}).get(key, 0) return priceIt should be faster to pass history as an apply argument rather than search it in the non-local scope. It also makes the code cleaner.def __link_price(row, history): ...df.apply(__link_price, args=(history, ))To summarize, a faster function would be something like this:history = df_hist.set_index("date").to_dict(orient="index")# we don't need to create a mapping on every __link_price callfund_and_share_class_to_key = { ("A", "X"): "AA", ("A", "Y"): "AB", ...}def __link_price(row, history, fund_and_share_class_to_key): date, fund, share_class = row[["date", "fund", "share_class"]] key = fund_and_share_class_to_key.get((fund, share_class)) return history.get(date, {}).get(key, 0)df.apply(__link_price, args=(history, fund_and_share_class_to_key))

Advertisement

Answer