Skip to content
Advertisement

How to iterate over consecutive chunks of Pandas dataframe efficiently

I have a large dataframe (several million rows).

I want to be able to do a groupby operation on it, but just grouping by arbitrary consecutive (preferably equal-sized) subsets of rows, rather than using any particular property of the individual rows to decide which group they go to.

The use case: I want to apply a function to each row via a parallel map in IPython. It doesn’t matter which rows go to which back-end engine, as the function calculates a result based on one row at a time. (Conceptually at least; in reality it’s vectorized.)

I’ve come up with something like this:

JavaScript

But this seems very long-winded, and doesn’t guarantee equal sized chunks. Especially if the index is sparse or non-integer or whatever.

Any suggestions for a better way?

Thanks!

Advertisement

Answer

In practice, you can’t guarantee equal-sized chunks. The number of rows (N) might be prime, in which case you could only get equal-sized chunks at 1 or N. Because of this, real-world chunking typically uses a fixed size and allows for a smaller chunk at the end. I tend to pass an array to groupby. Starting from:

JavaScript

where I’ve deliberately made the index uninformative by setting it to 0, we simply decide on our size (here 10) and integer-divide an array by it:

JavaScript

Methods based on slicing the DataFrame can fail when the index isn’t compatible with that, although you can always use .iloc[a:b] to ignore the index values and access data by position.

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement