def __link_price(row: pd.Series) -> Union[None, float]: if (row['fund'] == 'A') and (row['share_class'] == 'X'): return df_hist.loc[row['date'], 'AA'] elif (row['fund'] == 'A') and (row['share_class'] == 'Y'): return df_hist.loc[row['date'], 'AB'] elif (row['fund'] == 'B') and (row['share_class'] == 'X'): return df_hist.loc[row['date'], 'BA'] elif (row['fund'] == 'B') and (row['share_class'] == 'Y'): return df_hist.loc[row['date'], 'BB'] elif (row['fund'] == 'C') and (row['share_class'] == 'X'): return df_hist.loc[row['date'], 'CA'] elif (row['fund'] == 'C') and (row['share_class'] == 'Y'): return df_hist.loc[row['date'], 'CB'] else: return 0 df.loc[:, 'price'] = df.apply(__link_price_at_purchase, axis=1).values
df
has 10,000+ lines, so this code is taking a long time. In addition for each row, I’m doing a df_hist.loc
call to get the value.
I’m trying to speed up this section of code and then option I’ve found so far is using:
df.loc[:, 'price'] = df.apply(__link_price_at_purchase, axis=1, raw=True).values
But this forces me to use index based selection for row
instead of value selection:
if (row[0] == 'A') and (row[1] == 'X')
which reduces the readability of the code.
I’m looking for an approach that both speeds up the code and still allows for readability of the code.
Advertisement
Answer
In python, there is a certain cost for each attribute or item lookup and function call. And you don’t have a compiler that optimizes things for you.
Here are some general recommendations:
- Try creating a column that includes
fund
andshare_class
without using Python functions and then merge it withdf_hist
# convert history from 'wide' format into 'long' format hist = df_hist.set_index("date").stack() prices = ( # create key column for join df.assign(key=df["fund"] + df["share_class"].replace({"X": "A", "Y": "B"})) .set_index(["date", "key"]) .join(hist) # join by index )
- If it’s not trivial to create a
key
column, minimize attribute lookups inside the apply function:
def __link_price(row): date, fund, share_class = row[["date", "fund", "share_class"]] if fund == 'A' and share_class == 'X': return df_hist.loc[date, 'AA'] ...
- Optimize
if
conditions. For example, you need to check 6 conditions in case where(row['fund'] == 'C') and (row['share_class'] == 'Y')
. You can reduce this number to … 1.
fund_and_share_class_to_key = { ("A", "X"): "AA", ("A", "Y"): "AB", ... } key = fund_and_share_class_to_key.get((fund, share_class)) return df_hist.loc[date, key] if key is not None else 0
- Pandas itself is pretty slow for non-vectorized and non-arithmetic operations. In your case it’s better to use standard python dicts for faster lookups.
# small benchmark df = pd.DataFrame({"value": [4,5,6]}) d = df.to_dict(orient="index") %timeit df.loc[1, "value"] # 8.7ms %timeit d[1]["value"] # 50ns; ~170 times faster
# convert dataframe into the dict with format: # {<date>: {"AA": <value>}} history = df_hist.set_index("date").to_dict(orient="index") def __link_price(row): ... price = history.get(date, {}).get(key, 0) return price
- It should be faster to pass history as an apply argument rather than search it in the non-local scope. It also makes the code cleaner.
def __link_price(row, history): ... df.apply(__link_price, args=(history, ))
To summarize, a faster function would be something like this:
history = df_hist.set_index("date").to_dict(orient="index") # we don't need to create a mapping on every __link_price call fund_and_share_class_to_key = { ("A", "X"): "AA", ("A", "Y"): "AB", ... } def __link_price(row, history, fund_and_share_class_to_key): date, fund, share_class = row[["date", "fund", "share_class"]] key = fund_and_share_class_to_key.get((fund, share_class)) return history.get(date, {}).get(key, 0) df.apply(__link_price, args=(history, fund_and_share_class_to_key))