Skip to content
Advertisement

Replace duplicate value with NaN using groupby

Dataset(MWE)

location    date        people_vaccinated   people_fully_vaccinated people_vaccinated_per_hundred
AL          12-01-2021  70861               7270                    1.45
AL          13-01-2021  74792               9245                    1.53
AL          14-01-2021  80480               11366                   1.64
AL          15-01-2021  86956               13488                   1.77
AL          16-01-2021  93797               14202                   1.91
AL          17-01-2021  100638              14917                   2.05
AS          22-01-2021  5627                940                     10.1
AS          23-01-2021  5881                948                     10.56
AS          24-01-2021  7096                948                     12.74
AS          25-01-2021  7096                949                     12.98
AS          26-01-2021  7230                950                     13.23
AS          27-01-2021  8133                950                     14.6

I am trying to replace duplicates from columns {people_vaccinated,people_fully_vaccinated,people_vaccinated_per_hundred} with NaN while using groupby() on location. I tried some solution online, but couldn’t get them working for me, so instead used the below logic

def remove(df , a):
    df['duplicate'] = df[a].shift(1)
    df[a] = df.apply(lambda x: np.nan if x[a] == x['duplicate'] 
                            else x[a], axis=1)
    df = df.drop('duplicate', axis=1)
    return df

dfn = remove(dfn,'people_vaccinated')
dfn = remove(dfn,'people_fully_vaccinated')
dfn = remove(dfn,'people_vaccinated_per_hundred')

The above logic fails when you have consecutive nulls(more than 2). I need to replace duplicates(while keeping the first instance) with NaNs. What is the best way to do this? You can observe from the above snippet that column people_fully_vaccinated has duplicate values

Sample output

location    date        people_vaccinated   people_fully_vaccinated people_vaccinated_per_hundred
AL          12-01-2021  70861               7270                    1.45
AL          13-01-2021  74792               9245                    1.53
AL          14-01-2021  80480               11366                   1.64
AL          15-01-2021  86956               13488                   1.77
AL          16-01-2021  93797               14202                   1.91
AL          17-01-2021  100638              14917                   2.05
AS          22-01-2021  5627                940                     10.1
AS          23-01-2021  5881                948                     10.56
AS          24-01-2021  7096                NaN                     12.74
AS          25-01-2021  NaN                 949                     12.98
AS          26-01-2021  7230                950                     13.23
AS          27-01-2021  8133                NaN                     14.6

Advertisement

Answer

mask + duplicated

We can define a list of column names, then for each column inside a for loop mask the duplicate values per unique location

cols = ['people_vaccinated', 'people_fully_vaccinated', 'people_vaccinated_per_hundred']
for c in cols:
    df[c] = df[c].mask(df.duplicated(['location', c]))

   location        date  people_vaccinated  people_fully_vaccinated  people_vaccinated_per_hundred
0        AL  12-01-2021            70861.0                   7270.0                           1.45
1        AL  13-01-2021            74792.0                   9245.0                           1.53
2        AL  14-01-2021            80480.0                  11366.0                           1.64
3        AL  15-01-2021            86956.0                  13488.0                           1.77
4        AL  16-01-2021            93797.0                  14202.0                           1.91
5        AL  17-01-2021           100638.0                  14917.0                           2.05
6        AS  22-01-2021             5627.0                    940.0                          10.10
7        AS  23-01-2021             5881.0                    948.0                          10.56
8        AS  24-01-2021             7096.0                      NaN                          12.74
9        AS  25-01-2021                NaN                    949.0                          12.98
10       AS  26-01-2021             7230.0                    950.0                          13.23
11       AS  27-01-2021             8133.0                      NaN                          14.60
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement