I’m looking for a way to drop duplicate rows based one a certain column subset, but merge some data, so it does not get removed.
import pandas as pd # Example Dataframe data = { "Parcel": ['001', '002', '003', '003'], "Res" : ['Henry', 'Nick', 'Paul', 'Bill'], "Bill" : ['4,100', '2,300', '5,200', '4,000'], "Year" : ['1995', '1990', '2008', '2008'] } df = pd.DataFrame.from_dict(data=data)
Parcel | Res | Bill | Year |
---|---|---|---|
001 | Henry | 4,100 | 1995 |
002 | Nick | 2,300 | 1990 |
003 | Paul | 5,200 | 2008 |
003 | Bill | 4,000 | 2008 |
Some pseudo code would look something like this:
df = df.drop_duplicates(subset='Parcel', keep_data=['Res', 'Bill'])
Parcel | Res | Bill | Year |
---|---|---|---|
001 | Henry | 4,100 | 1995 |
002 | Nick | 2,300 | 1990 |
003 | Paul, Bill | 5,200, 4,000 | 2008 |
I’m not sure where to begin with this, but any tips as to where to look would be appreciated.
Advertisement
Answer
You can use .groupby
with .agg
:
df = ( df.groupby("Parcel") .agg({"Res": ", ".join, "Bill": ", ".join, "Year": "first"}) .reset_index() ) print(df)
Prints:
Parcel Res Bill Year 0 001 Henry 4,100 1995 1 002 Nick 2,300 1990 2 003 Paul, Bill 5,200, 4,000 2008
EDIT: If you have many columns, you can aggregate all values by “first” and then update the dataframe:
g = df.groupby("Parcel") x = g.agg("first") x.update(g.agg({"Res": ", ".join, "Bill": ", ".join})) print(x.reset_index()) Parcel Res Bill Year 0 001 Henry 4,100 1995 1 002 Nick 2,300 1990 2 003 Paul, Bill 5,200, 4,000 2008