I’m looking to concatenate two columns in data frame and, where there are duplicates, append an integer number at the end. The wrinkle here is that I will keep receiving feeds of data and the increment needs to be aware of historical values that were generated and not reuse them.
I’ve been trying to do this with an apply function but I’m having issues when there are duplicates within a single received data set and I just can’t wrap my head around a way to do this without iterating through the data frame (which is generally frowned upon).
I’ve gotten this far:
import pandas as pd def gen_summary(color, car, blacklist): exists = True increment = 0 summary = color + car while exists: if summary in blacklist: increment += 1 summary = color + car + str(increment) # Append increment if in burn list else: exists = False # Exit this loop return summary def main(): blacklist = ['RedToyota', 'BlueVolkswagon', 'BlueVolkswagon1', 'BlueVolkswagon2'] df = pd.DataFrame( {'color': ['Red', 'Blue', 'Blue', 'Green'], 'car': ['Toyota', 'Volkswagon', 'Volkswagon', 'Hyundai'], 'summary': ['', '', '', '']} ) #print(df) df["summary"] = df.apply(lambda x: gen_summary(x['color'], x['car'], blacklist), axis=1) print(df) if __name__ == "__main__": main()
Output:
color car summary 0 Red Toyota RedToyota1 1 Blue Volkswagon BlueVolkswagon3 2 Blue Volkswagon BlueVolkswagon3 3 Green Hyundai GreenHyundai
Note that BlueVolkswagon1 and BlueVolkswagon2 were used in previous data feeds so it has to start from 3 here. The real issue is that there are duplicate BlueVolkswagon values in just this data set so it doesn’t increment properly and duplicates BlueVolkswagon3 because I can’t update the history in the middle of applying a function to the entire data set.
Is there some elegant pythonic way to do this that I can’t wrap my head around or is this a scenario where iterating through the data frame actually does make sense?
Advertisement
Answer
I’m not completely sure what you want to achieve, but you can update blacklist
in the process. blacklist
is just a pointer to the actual list data. If you slightly modify gen_summary
by adding blacklist.append(summary)
before the return
statement
def gen_summary(color, car, blacklist): ... exists = False # Exit this loop blacklist.append(summary) return summary
you will get following result
color car summary 0 Red Toyota RedToyota1 1 Blue Volkswagon BlueVolkswagon3 2 Blue Volkswagon BlueVolkswagon4 3 Green Hyundai GreenHyundai
Grouping would be a bit more efficient. This should produce the same result:
def gen_summary(ser, blacklist): color_car = ser.iat[0] summary = color_car increment = 0 exists = True while exists: if summary in blacklist: increment += 1 summary = color_car + str(increment) # Append increment if in burn list else: exists = False # Exit this loop return ([color_car + ('' if increment == 0 else str(increment))] + [color_car + str(i + increment) for i in range(1, len(ser))]) df['summary'] = df['color'] + df['car'] df['summary'] = df.groupby(['color', 'car']).transform(gen_summary, blacklist)
Is that the result you are looking for? If yes, I’d like to add a suggestion for optimising your approach: Use a dictionary instead of a list for blacklist
:
def gen_summary(color, car, blacklist): key = color + car num = blacklist.get(key, -1) + 1 blacklist[key] = num return key if num == 0 else f'{key}{num}' blacklist = {'RedToyota': 0, 'BlueVolkswagon': 2}
or with grouping
def gen_summary(ser, blacklist): key = ser.iat[0] num = blacklist.get(key, -1) + 1 return ([f'{key}{"" if num == 0 else num}'] + [f'{key}{i + num}' for i in range(1, len(ser))]) blacklist = {'RedToyota': 0, 'BlueVolkswagon': 2} df['summary'] = df['color'] + df['car'] df['summary'] = df.groupby(['color', 'car']).transform(gen_summary, blacklist)
should produce the same result without the while
-loop and a much faster lookup.