Skip to content
Advertisement

How to speed up successive pd.apply with successive pd.DataFrame.loc calls?

JavaScript

df has 10,000+ lines, so this code is taking a long time. In addition for each row, I’m doing a df_hist.loc call to get the value.

I’m trying to speed up this section of code and then option I’ve found so far is using:

JavaScript

But this forces me to use index based selection for row instead of value selection:

JavaScript

which reduces the readability of the code.

I’m looking for an approach that both speeds up the code and still allows for readability of the code.

Advertisement

Answer

In python, there is a certain cost for each attribute or item lookup and function call. And you don’t have a compiler that optimizes things for you.

Here are some general recommendations:

  1. Try creating a column that includes fund and share_class without using Python functions and then merge it with df_hist
JavaScript
  1. If it’s not trivial to create a key column, minimize attribute lookups inside the apply function:
JavaScript
  1. Optimize if conditions. For example, you need to check 6 conditions in case where (row['fund'] == 'C') and (row['share_class'] == 'Y'). You can reduce this number to … 1.
JavaScript
  1. Pandas itself is pretty slow for non-vectorized and non-arithmetic operations. In your case it’s better to use standard python dicts for faster lookups.
JavaScript
JavaScript
  1. It should be faster to pass history as an apply argument rather than search it in the non-local scope. It also makes the code cleaner.
JavaScript

To summarize, a faster function would be something like this:

JavaScript
Advertisement