Python fast DataFrame concatenation

Question

I wrote a code to concatenate parts of a DataFrame to the same DataFrame as to normalize the occurrence of rows as per a certain column. and this is unbelievably slow. Is there a way to fast concatenate DataFrame without creating copies of it? Answer There are a couple of things that stand out. To begin with, the loop is

Accepted Answer

There are a couple of things that stand out.To begin with, the loopi = 0while i < (max_count // int(counts[tag])):    array = pandas.concat([array, group])    i += 1is going to be very slow. Pandas is not built for these dynamic concatenations, and I suspect the performance is quadratic for what you&#8217;re doing. Instead, perhaps you could trypandas.concat([group] * (max_count // int(counts[tag]))which just creates a list first, and then calls concat for a one-shot concatenation on the entire list. This should bring the complexity to being linear, and I suspect it will have lower constants in any case.Another thing which would reduce these small concats is calling groupby-apply. Instead of iterating over the result of groupby, write the loop body as a function, and call apply on it. Let Pandas figure out best how to concat all of the results into a single DataFrame.However, even if you prefer to keep the loop, I&#8217;d just append things into a list, and just concat everything at the end:stuff = []for tag, group in data.groupby(expectation, sort=False):    # Call stuff.append for any DataFrame you were going to concat.pandas.concat(stuff)

Advertisement

Answer