Skip to content
Advertisement

Pandas, drop duplicates but merge certain columns

I’m looking for a way to drop duplicate rows based one a certain column subset, but merge some data, so it does not get removed.

import pandas as pd

# Example Dataframe
data = {
    "Parcel": ['001', '002', '003', '003'],
    "Res"   : ['Henry', 'Nick', 'Paul', 'Bill'],
    "Bill"  : ['4,100', '2,300', '5,200', '4,000'],
    "Year"  : ['1995', '1990', '2008', '2008']
}
df = pd.DataFrame.from_dict(data=data)
Parcel Res Bill Year
001 Henry 4,100 1995
002 Nick 2,300 1990
003 Paul 5,200 2008
003 Bill 4,000 2008

Some pseudo code would look something like this:

df = df.drop_duplicates(subset='Parcel', keep_data=['Res', 'Bill'])
Parcel Res Bill Year
001 Henry 4,100 1995
002 Nick 2,300 1990
003 Paul, Bill 5,200, 4,000 2008

I’m not sure where to begin with this, but any tips as to where to look would be appreciated.

Advertisement

Answer

You can use .groupby with .agg:

df = (
    df.groupby("Parcel")
    .agg({"Res": ", ".join, "Bill": ", ".join, "Year": "first"})
    .reset_index()
)
print(df)

Prints:

  Parcel         Res          Bill  Year
0    001       Henry         4,100  1995
1    002        Nick         2,300  1990
2    003  Paul, Bill  5,200, 4,000  2008

EDIT: If you have many columns, you can aggregate all values by “first” and then update the dataframe:

g = df.groupby("Parcel")
x = g.agg("first")
x.update(g.agg({"Res": ", ".join, "Bill": ", ".join}))
print(x.reset_index())

  Parcel         Res          Bill  Year
0    001       Henry         4,100  1995
1    002        Nick         2,300  1990
2    003  Paul, Bill  5,200, 4,000  2008
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement