I had a simple Excel task to do today that I figured i’d use some Python to clean up. This led me to pandas and numpy.This is what i’d like to know if possible:
I have these columns and about 5k rows:
First Name | Last Name | Email | Address | City
I want to remove duplicates that fall within Address & City BUT, not all rows have a EMail or a Last Name. So I want to look at the row and delete the row that doesn’t contain a email address, keeping the one that does.
However, I have some duplicate rows that have perhaps the same Last Name but with no e-mail, so i’d want to make sure that I keep atleast one of those rows, or insert NAN or something into the email field so that atleast one of the rows gets kept.
I guess in pseudocode it’d be this:
1. if Last Name & Address & City is a duplicate & Email Address on both rows is blank insert a variable into one of the rows Email Address field 2. if Address & City is a duplicate, remove the row that does not have a e-mail address assigned to it.
I’ve gotten it to work by manually going in and doing step 1 which as you can guess, is not fun lol. So i’m wondering if it’s possible to do it with Pandas at all.
Here is sample Data:
df = pd.DataFrame({ "First Name": ["Bob", "Ken", "Bobs Business", "Daniel", "Wendy", "Kyle"], "Last Name": ["Arnold", "Arnold", "", "Amigo", "Amigo", "Zecke"], "Email": ["", "", "Bb@bobsbusiness.com", "amigo@amigo.com", "", "k@zecke.com"], "Address": ["123 Street", "123 Street", "123 Street", "5 Street", "5 Street", "5 Street"], "City": ["Boston", "Boston", "Boston", "Concord", "Concord", "Denver"] })
The output expected:
First Name Last Name Email Address City Ken Arnold 123 Street Boston Bobs Business bb@bobsbusiness.com 123 Street Boston Daniel Amigo amigo@amigo.com 5 Street Concord Kyle Zecke k@zecke.com 5 Street Denver
Thank you for any help or pointing me in the right direction! :)
Advertisement
Answer
First of all, you should provide example data, so we can easily test code on your data. I think you have to do 2 things:
- sort your data (sort_values)
- delete the duplicated lines (drop_duplicates)
you have to check if you use None values or emtpy string, because they befave differently on sorting, maybe you have to change keep to “first”.
import pandas as pd def run(): df = pd.DataFrame({ "First Name": ["John", "", "Jane", ""], "Last Name": ["Last1", "Last2", "Last3", "Last3"], "Email": ["", "Email2", None, "Email4"], "Address": ["Address1", "Address1", "Address2", "Address2"], "City": ["City1", "City1", "City2", "City2"] }) print(df) print() df.sort_values(by=["Last Name", "Address", "City", "Email"], inplace=True) df.drop_duplicates(subset=["Last Name", "Address", "City"], keep="last", inplace=True) print(df) if __name__ == '__main__': run()
output:
First Name Last Name Email Address City 0 Bob Arnold 123 Street Boston 1 Ken Arnold 123 Street Boston 2 Bobs Business Bb@bobsbusiness.com 123 Street Boston 3 Daniel Amigo amigo@amigo.com 5 Street Concord 4 Wendy Amigo 5 Street Concord 5 Kyle Zecke k@zecke.com 5 Street Denver First Name Last Name Email Address City 2 Bobs Business Bb@bobsbusiness.com 123 Street Boston 3 Daniel Amigo amigo@amigo.com 5 Street Concord 1 Ken Arnold 123 Street Boston 5 Kyle Zecke k@zecke.com 5 Street Denver