I’m new to pandas and I have a question.
I have a dataframe like
JavaScript
x
6
1
Code Keywords
2
A Real estate, loan, building, office, land, warehouse
3
B Real Estate Lease , Real Estate, building, Office, Warehouse, rental, Tenant, broker advisor, Real Estate Lease , Lease and rent
4
C Transport Air freight, shift, cargo, truck, insurance, Transport Insurance, Transport
5
D Transport, shift, cargo, truck, insurance, Transport Insurance
6
and I should remove duplicates on “Keywords” column, no matter if the duplicates are on the same row or on 3 different rows. No matter if it is written “warehouse” or “Warehouse” Everything value duplicated is removed
The result should look like this:
JavaScript
1
6
1
Code Keywords
2
A loan, land
3
B Real Estate Lease, rental, Tenant, broker advisor, Real Estate Lease , Lease and rent
4
C Transport Air freight
5
D
6
For instance, column “D” will not have keywords at all, because all of them have duplicates on other rows
Thank you
Advertisement
Answer
One way using pandas.Series.str.split
with explode
:
JavaScript
1
5
1
m = df["Keywords"].str.split("s*,s*").explode()
2
m = m[~m.str.lower().duplicated(False)]
3
df["Keywords"] = m.groupby(m.index).apply(", ".join)
4
df = df.fillna("")
5
Output:
JavaScript
1
6
1
Code Keywords
2
0 A loan, land
3
1 B rental, Tenant, broker advisor, Lease and rent
4
2 C Transport Air freight
5
3 D
6