How to iterate over consecutive chunks of Pandas dataframe efficiently

Question

I have a large dataframe (several million rows). I want to be able to do a groupby operation on it, but just grouping by arbitrary consecutive (preferably equal-sized) subsets of rows, rather than using any particular property of the individual rows to decide which group they go to. The use case: I want to apply a function to each row

Accepted Answer

In practice, you can&#8217;t guarantee equal-sized chunks. The number of rows (N) might be prime, in which case you could only get equal-sized chunks at 1 or N. Because of this, real-world chunking typically uses a fixed size and allows for a smaller chunk at the end. I tend to pass an array to groupby.  Starting from:>>> df = pd.DataFrame(np.random.rand(15, 5), index=[0]*15)>>> df[0] = range(15)>>> df    0         1         2         3         40   0  0.746300  0.346277  0.220362  0.1726800   1  0.657324  0.687169  0.384196  0.2141180   2  0.016062  0.858784  0.236364  0.963389[...]0  13  0.510273  0.051608  0.230402  0.7569210  14  0.950544  0.576539  0.642602  0.907850[15 rows x 5 columns]where I&#8217;ve deliberately made the index uninformative by setting it to 0, we simply decide on our size (here 10) and integer-divide an array by it:>>> df.groupby(np.arange(len(df))//10)<pandas.core.groupby.DataFrameGroupBy object at 0xb208492c>>>> for k,g in df.groupby(np.arange(len(df))//10):...     print(k,g)...     0    0         1         2         3         40  0  0.746300  0.346277  0.220362  0.1726800  1  0.657324  0.687169  0.384196  0.2141180  2  0.016062  0.858784  0.236364  0.963389[...]0  8  0.241049  0.246149  0.241935  0.5634280  9  0.493819  0.918858  0.193236  0.266257[10 rows x 5 columns]1     0         1         2         3         40  10  0.037693  0.370789  0.369117  0.4010410  11  0.721843  0.862295  0.671733  0.605006[...]0  14  0.950544  0.576539  0.642602  0.907850[5 rows x 5 columns]Methods based on slicing the DataFrame can fail when the index isn&#8217;t compatible with that, although you can always use .iloc[a:b] to ignore the index values and access data by position.

Advertisement

Answer