Skip to content
Advertisement

Adding an increment to duplicates within a python dataframe

I’m looking to concatenate two columns in data frame and, where there are duplicates, append an integer number at the end. The wrinkle here is that I will keep receiving feeds of data and the increment needs to be aware of historical values that were generated and not reuse them.

I’ve been trying to do this with an apply function but I’m having issues when there are duplicates within a single received data set and I just can’t wrap my head around a way to do this without iterating through the data frame (which is generally frowned upon).

I’ve gotten this far:

JavaScript

Output:

JavaScript

Note that BlueVolkswagon1 and BlueVolkswagon2 were used in previous data feeds so it has to start from 3 here. The real issue is that there are duplicate BlueVolkswagon values in just this data set so it doesn’t increment properly and duplicates BlueVolkswagon3 because I can’t update the history in the middle of applying a function to the entire data set.

Is there some elegant pythonic way to do this that I can’t wrap my head around or is this a scenario where iterating through the data frame actually does make sense?

Advertisement

Answer

I’m not completely sure what you want to achieve, but you can update blacklist in the process. blacklist is just a pointer to the actual list data. If you slightly modify gen_summary by adding blacklist.append(summary) before the return statement

JavaScript

you will get following result

JavaScript

Grouping would be a bit more efficient. This should produce the same result:

JavaScript

Is that the result you are looking for? If yes, I’d like to add a suggestion for optimising your approach: Use a dictionary instead of a list for blacklist:

JavaScript

or with grouping

JavaScript

should produce the same result without the while-loop and a much faster lookup.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement