Skip to content
Advertisement

How to speed up successive pd.apply with successive pd.DataFrame.loc calls?

def __link_price(row: pd.Series) -> Union[None, float]:
    if (row['fund'] == 'A') and (row['share_class'] == 'X'):
        return df_hist.loc[row['date'], 'AA']
    elif (row['fund'] == 'A') and (row['share_class'] == 'Y'):
        return df_hist.loc[row['date'], 'AB']
    elif (row['fund'] == 'B') and (row['share_class'] == 'X'):
        return df_hist.loc[row['date'], 'BA']
    elif (row['fund'] == 'B') and (row['share_class'] == 'Y'):
        return df_hist.loc[row['date'], 'BB']
    elif (row['fund'] == 'C') and (row['share_class'] == 'X'):
        return df_hist.loc[row['date'], 'CA']
    elif (row['fund'] == 'C') and (row['share_class'] == 'Y'):
        return df_hist.loc[row['date'], 'CB']
    else:
        return 0

df.loc[:, 'price'] = df.apply(__link_price_at_purchase, axis=1).values

df has 10,000+ lines, so this code is taking a long time. In addition for each row, I’m doing a df_hist.loc call to get the value.

I’m trying to speed up this section of code and then option I’ve found so far is using:

df.loc[:, 'price'] = df.apply(__link_price_at_purchase, axis=1, raw=True).values

But this forces me to use index based selection for row instead of value selection:

if (row[0] == 'A') and (row[1] == 'X')

which reduces the readability of the code.

I’m looking for an approach that both speeds up the code and still allows for readability of the code.

Advertisement

Answer

In python, there is a certain cost for each attribute or item lookup and function call. And you don’t have a compiler that optimizes things for you.

Here are some general recommendations:

  1. Try creating a column that includes fund and share_class without using Python functions and then merge it with df_hist
# convert history from 'wide' format into 'long' format
hist = df_hist.set_index("date").stack()
prices = (
    # create key column for join
    df.assign(key=df["fund"] + df["share_class"].replace({"X": "A", "Y": "B"}))
    .set_index(["date", "key"])
    .join(hist)  # join by index
)
  1. If it’s not trivial to create a key column, minimize attribute lookups inside the apply function:
def __link_price(row):
    date, fund, share_class = row[["date", "fund", "share_class"]]
    
    if fund == 'A' and share_class == 'X':
        return df_hist.loc[date, 'AA']
    ...
  1. Optimize if conditions. For example, you need to check 6 conditions in case where (row['fund'] == 'C') and (row['share_class'] == 'Y'). You can reduce this number to … 1.
fund_and_share_class_to_key = {
    ("A", "X"): "AA",
    ("A", "Y"): "AB",
    ...
}

key = fund_and_share_class_to_key.get((fund, share_class))
return df_hist.loc[date, key] if key is not None else 0
  1. Pandas itself is pretty slow for non-vectorized and non-arithmetic operations. In your case it’s better to use standard python dicts for faster lookups.
# small benchmark
df = pd.DataFrame({"value": [4,5,6]})
d = df.to_dict(orient="index")
%timeit df.loc[1, "value"]  # 8.7ms
%timeit d[1]["value"]       # 50ns; ~170 times faster
# convert dataframe into the dict with format:
# {<date>: {"AA": <value>}}
history = df_hist.set_index("date").to_dict(orient="index")

def __link_price(row):
    ...
    price = history.get(date, {}).get(key, 0)
    return price
  1. It should be faster to pass history as an apply argument rather than search it in the non-local scope. It also makes the code cleaner.
def __link_price(row, history):
    ...
df.apply(__link_price, args=(history, ))

To summarize, a faster function would be something like this:

history = df_hist.set_index("date").to_dict(orient="index")
# we don't need to create a mapping on every __link_price call
fund_and_share_class_to_key = {
    ("A", "X"): "AA",
    ("A", "Y"): "AB",
    ...
}
def __link_price(row, history, fund_and_share_class_to_key):
    date, fund, share_class = row[["date", "fund", "share_class"]]
    key = fund_and_share_class_to_key.get((fund, share_class))
    return history.get(date, {}).get(key, 0)

df.apply(__link_price, args=(history, fund_and_share_class_to_key))
Advertisement