Skip to content
Advertisement

Trying to optimize a Python for loop for large data set (4 million rows)

I’m new to Python and getting used to these really handy implicit array/list actions, so please bear with me. I’ve completed a proof-of-concept code (120 combinations), but as expected it is experiencing a significant slowdown when working against the full dataset (4 million combinations). The current slowdown is in the following for loop:

for i in Combinations:

PropList = function.CalculateTotalPoints(ItemDict, i)

CombinationSets.loc[Combinations.index(i), 'Density':] = PropList

I’m trying not to make this a wall of text, but provide enough information to understand the goal.

  • Combinations – This is a dataframe that holds all the unique combinations of 4 items (4 million rows). I end up deleting this after the loop, when it is no longer needed
  • PropList – This is a returned list of the cumulative properties of the 4 items in the specific row
  • ItemDict – This is a dataframe the contains all the property values for the individual items
  • CombinationSets – This is the “final” dataframe that holds the 4 unique items from Combinations with the appended PropList from the CalculatedTotalPoints function.
  • CalculateTotalPoints – This function is accepting the list of 4 items (i) and a reference to the dataframe with all the property values (ItemDict). It is really just doing a bunch of simple addition and division operations with some logic thrown in and returning a list of the various cumulative properties.

I’m sure there are plenty of “not great Python” things happening here, but my current thoughts are:

  • Python may have a better/faster way to implement this loop instead of specifying a for loop (list comprehension?)
  • The ItemDict isn’t terribly large, but maybe getting passed to the function each time is unnecessary overhead
  • Since each row of the CombinationSet dataframe is independent of each other, I should try to implement multiprocessing so multiple rows can be processed

Edit 1: I couldn’t figure out how to isolate the code enough to make a small example problem. Not ideal for post longevity, but I’ve added the main code up to the for loop, support functions, and sample data to a Github Gist (Removed)

Edit 2: From what I’ve been reading elsewhere it seems that generally for loops are not the preferred method of looping in Python. I haven’t successfully been able to vectorize my function. However, I did look into the timing for this for loop, the function (has a for loop inside), and the assignment of PropList. When running a small (120 combinations) or a large (4.5 million) set, the for loop and the function have similar (pretty fast) performance. In both cases, the line below is taking the longest. For the large data set, it takes ~1 second.

CombinationSets.loc[Combinations.index(i), 'Density':] = PropList

Edit 3: I was able to reduce the time for the loop operation by instead of performing the merging operation CombinationSets.loc[ ... inside the loop, I made a list of the returned PropList. Once exiting the for loop, I performed the merge operation all at once. As it is currently set up a single run through the for loop is ~0.004s vs. 1 sec. There are likely additional optimazation that could be done, but I’ll save that for a separate post.

Advertisement

Answer

Move the list to dataframe merge operation outside of for loop. See Edit 3.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement