I’m new to pandas and I have a question.
I have a dataframe like
Code Keywords A Real estate, loan, building, office, land, warehouse B Real Estate Lease , Real Estate, building, Office, Warehouse, rental, Tenant, broker advisor, Real Estate Lease , Lease and rent C Transport Air freight, shift, cargo, truck, insurance, Transport Insurance, Transport D Transport, shift, cargo, truck, insurance, Transport Insurance
and I should remove duplicates on “Keywords” column, no matter if the duplicates are on the same row or on 3 different rows. No matter if it is written “warehouse” or “Warehouse” Everything value duplicated is removed
The result should look like this:
Code Keywords A loan, land B Real Estate Lease, rental, Tenant, broker advisor, Real Estate Lease , Lease and rent C Transport Air freight D
For instance, column “D” will not have keywords at all, because all of them have duplicates on other rows
Thank you
Advertisement
Answer
One way using pandas.Series.str.split
with explode
:
m = df["Keywords"].str.split("s*,s*").explode() m = m[~m.str.lower().duplicated(False)] df["Keywords"] = m.groupby(m.index).apply(", ".join) df = df.fillna("")
Output:
Code Keywords 0 A loan, land 1 B rental, Tenant, broker advisor, Lease and rent 2 C Transport Air freight 3 D