Skip to content
Advertisement

Delete duplicates if multiple fields follow conditions?

I had a simple Excel task to do today that I figured i’d use some Python to clean up. This led me to pandas and numpy.This is what i’d like to know if possible:

I have these columns and about 5k rows:

First Name | Last Name | Email | Address | City

I want to remove duplicates that fall within Address & City BUT, not all rows have a EMail or a Last Name. So I want to look at the row and delete the row that doesn’t contain a email address, keeping the one that does.

However, I have some duplicate rows that have perhaps the same Last Name but with no e-mail, so i’d want to make sure that I keep atleast one of those rows, or insert NAN or something into the email field so that atleast one of the rows gets kept.

I guess in pseudocode it’d be this:

1. if Last Name & Address & City is a duplicate & Email Address on both rows is blank
             insert a variable into one of the rows Email Address field
2. if Address & City is a duplicate, remove the row that does not have a e-mail address assigned to it.

I’ve gotten it to work by manually going in and doing step 1 which as you can guess, is not fun lol. So i’m wondering if it’s possible to do it with Pandas at all.

Here is sample Data:

       df = pd.DataFrame({
            "First Name": ["Bob", "Ken", "Bobs Business", "Daniel", "Wendy", "Kyle"],
            "Last Name": ["Arnold", "Arnold", "", "Amigo", "Amigo", "Zecke"],
            "Email": ["", "", "Bb@bobsbusiness.com", "amigo@amigo.com",  "", "k@zecke.com"],
            "Address": ["123 Street", "123 Street", "123 Street", "5 Street", "5 Street", "5 Street"],
            "City": ["Boston", "Boston", "Boston", "Concord", "Concord", "Denver"]
        })

The output expected:

First Name Last Name Email                Address        City
Ken         Arnold                       123 Street    Boston
Bobs Business        bb@bobsbusiness.com 123 Street  Boston
Daniel      Amigo    amigo@amigo.com     5 Street      Concord
Kyle        Zecke    k@zecke.com         5 Street      Denver

Thank you for any help or pointing me in the right direction! :)

Advertisement

Answer

First of all, you should provide example data, so we can easily test code on your data. I think you have to do 2 things:

you have to check if you use None values or emtpy string, because they befave differently on sorting, maybe you have to change keep to “first”.

import pandas as pd


def run():
    df = pd.DataFrame({
        "First Name": ["John", "", "Jane", ""],
        "Last Name": ["Last1", "Last2", "Last3", "Last3"],
        "Email": ["", "Email2", None, "Email4"],
        "Address": ["Address1", "Address1", "Address2", "Address2"],
        "City": ["City1", "City1", "City2", "City2"]
    })

    print(df)
    print()

    df.sort_values(by=["Last Name", "Address", "City", "Email"], inplace=True)
    df.drop_duplicates(subset=["Last Name", "Address", "City"], keep="last", inplace=True)

    print(df)


if __name__ == '__main__':
    run()

output:

      First Name Last Name                Email     Address     City
0            Bob    Arnold                       123 Street   Boston
1            Ken    Arnold                       123 Street   Boston
2  Bobs Business            Bb@bobsbusiness.com  123 Street   Boston
3         Daniel     Amigo      amigo@amigo.com    5 Street  Concord
4          Wendy     Amigo                         5 Street  Concord
5           Kyle     Zecke          k@zecke.com    5 Street   Denver

      First Name Last Name                Email     Address     City
2  Bobs Business            Bb@bobsbusiness.com  123 Street   Boston
3         Daniel     Amigo      amigo@amigo.com    5 Street  Concord
1            Ken    Arnold                       123 Street   Boston
5           Kyle     Zecke          k@zecke.com    5 Street   Denver
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement