Skip to content
Advertisement

Adding an increment to duplicates within a python dataframe

I’m looking to concatenate two columns in data frame and, where there are duplicates, append an integer number at the end. The wrinkle here is that I will keep receiving feeds of data and the increment needs to be aware of historical values that were generated and not reuse them.

I’ve been trying to do this with an apply function but I’m having issues when there are duplicates within a single received data set and I just can’t wrap my head around a way to do this without iterating through the data frame (which is generally frowned upon).

I’ve gotten this far:

import pandas as pd

def gen_summary(color, car, blacklist):
    exists = True
    increment = 0
    summary = color + car
    while exists:
        if summary in blacklist:
            increment += 1
            summary = color + car + str(increment)  # Append increment if in burn list
        else:
            exists = False  # Exit this loop
    return summary


def main():
    blacklist = ['RedToyota', 'BlueVolkswagon', 'BlueVolkswagon1', 'BlueVolkswagon2']
    df = pd.DataFrame(
        {'color': ['Red', 'Blue', 'Blue', 'Green'],
         'car': ['Toyota', 'Volkswagon', 'Volkswagon', 'Hyundai'],
         'summary': ['', '', '', '']}
    )
    #print(df)
    df["summary"] = df.apply(lambda x: gen_summary(x['color'], x['car'], blacklist), axis=1)
    print(df)


if __name__ == "__main__":
    main()

Output:

   color         car          summary
0    Red      Toyota       RedToyota1
1   Blue  Volkswagon  BlueVolkswagon3
2   Blue  Volkswagon  BlueVolkswagon3
3  Green     Hyundai     GreenHyundai

Note that BlueVolkswagon1 and BlueVolkswagon2 were used in previous data feeds so it has to start from 3 here. The real issue is that there are duplicate BlueVolkswagon values in just this data set so it doesn’t increment properly and duplicates BlueVolkswagon3 because I can’t update the history in the middle of applying a function to the entire data set.

Is there some elegant pythonic way to do this that I can’t wrap my head around or is this a scenario where iterating through the data frame actually does make sense?

Advertisement

Answer

I’m not completely sure what you want to achieve, but you can update blacklist in the process. blacklist is just a pointer to the actual list data. If you slightly modify gen_summary by adding blacklist.append(summary) before the return statement

def gen_summary(color, car, blacklist):
    ...
            exists = False  # Exit this loop
    blacklist.append(summary)
    return summary

you will get following result

   color         car          summary
0    Red      Toyota       RedToyota1
1   Blue  Volkswagon  BlueVolkswagon3
2   Blue  Volkswagon  BlueVolkswagon4
3  Green     Hyundai     GreenHyundai

Grouping would be a bit more efficient. This should produce the same result:

def gen_summary(ser, blacklist):
    color_car = ser.iat[0]
    summary = color_car
    increment = 0
    exists = True
    while exists:
        if summary in blacklist:
            increment += 1
            summary = color_car + str(increment)  # Append increment if in burn list
        else:
            exists = False  # Exit this loop
    return ([color_car + ('' if increment == 0 else str(increment))]
            + [color_car + str(i + increment) for i in range(1, len(ser))])

df['summary'] = df['color'] + df['car']
df['summary'] = df.groupby(['color', 'car']).transform(gen_summary, blacklist)

Is that the result you are looking for? If yes, I’d like to add a suggestion for optimising your approach: Use a dictionary instead of a list for blacklist:

def gen_summary(color, car, blacklist):
    key = color + car
    num = blacklist.get(key, -1) + 1
    blacklist[key] = num
    return key if num == 0 else f'{key}{num}'

blacklist = {'RedToyota': 0, 'BlueVolkswagon': 2}

or with grouping

def gen_summary(ser, blacklist):
    key = ser.iat[0]
    num = blacklist.get(key, -1) + 1
    return ([f'{key}{"" if num == 0 else num}']
            + [f'{key}{i + num}' for i in range(1, len(ser))])

blacklist = {'RedToyota': 0, 'BlueVolkswagon': 2}
df['summary'] = df['color'] + df['car']
df['summary'] = df.groupby(['color', 'car']).transform(gen_summary, blacklist)

should produce the same result without the while-loop and a much faster lookup.

User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement