I wrote a code to concatenate parts of a DataFrame to the same DataFrame as to normalize the occurrence of rows as per a certain column.
import random def normalize(data, expectation): """Normalize data by duplicating existing rows""" counts = data[expectation].value_counts() max_count = int(counts.max()) for tag, group in data.groupby(expectation, sort=False): array = pandas.DataFrame(columns=data.columns.values) i = 0 while i < (max_count // int(counts[tag])): array = pandas.concat([array, group]) i += 1 i = max_count % counts[tag] if i > 0: array = pandas.concat([array, group.ix[random.sample(group.index, i)]]) data = pandas.concat([data, array]) return data
and this is unbelievably slow. Is there a way to fast concatenate DataFrame without creating copies of it?
Advertisement
Answer
There are a couple of things that stand out.
To begin with, the loop
i = 0 while i < (max_count // int(counts[tag])): array = pandas.concat([array, group]) i += 1
is going to be very slow. Pandas is not built for these dynamic concatenations, and I suspect the performance is quadratic for what you’re doing.
Instead, perhaps you could try
pandas.concat([group] * (max_count // int(counts[tag]))
which just creates a list first, and then calls concat
for a one-shot concatenation on the entire list. This should bring the complexity to being linear, and I suspect it will have lower constants in any case.
Another thing which would reduce these small concats
is calling groupby-apply
. Instead of iterating over the result of groupby
, write the loop body as a function, and call apply
on it. Let Pandas figure out best how to concat all of the results into a single DataFrame.
However, even if you prefer to keep the loop, I’d just append things into a list, and just concat
everything at the end:
stuff = [] for tag, group in data.groupby(expectation, sort=False): # Call stuff.append for any DataFrame you were going to concat. pandas.concat(stuff)