Dataset(MWE)
location date people_vaccinated people_fully_vaccinated people_vaccinated_per_hundred AL 12-01-2021 70861 7270 1.45 AL 13-01-2021 74792 9245 1.53 AL 14-01-2021 80480 11366 1.64 AL 15-01-2021 86956 13488 1.77 AL 16-01-2021 93797 14202 1.91 AL 17-01-2021 100638 14917 2.05 AS 22-01-2021 5627 940 10.1 AS 23-01-2021 5881 948 10.56 AS 24-01-2021 7096 948 12.74 AS 25-01-2021 7096 949 12.98 AS 26-01-2021 7230 950 13.23 AS 27-01-2021 8133 950 14.6
I am trying to replace duplicates from columns {people_vaccinated,people_fully_vaccinated,people_vaccinated_per_hundred
} with NaN while using groupby()
on location
. I tried some solution online, but couldn’t get them working for me, so instead used the below logic
def remove(df , a): df['duplicate'] = df[a].shift(1) df[a] = df.apply(lambda x: np.nan if x[a] == x['duplicate'] else x[a], axis=1) df = df.drop('duplicate', axis=1) return df dfn = remove(dfn,'people_vaccinated') dfn = remove(dfn,'people_fully_vaccinated') dfn = remove(dfn,'people_vaccinated_per_hundred')
The above logic fails when you have consecutive nulls(more than 2). I need to replace duplicates(while keeping the first instance) with NaNs. What is the best way to do this? You can observe from the above snippet that column people_fully_vaccinated
has duplicate values
Sample output
location date people_vaccinated people_fully_vaccinated people_vaccinated_per_hundred AL 12-01-2021 70861 7270 1.45 AL 13-01-2021 74792 9245 1.53 AL 14-01-2021 80480 11366 1.64 AL 15-01-2021 86956 13488 1.77 AL 16-01-2021 93797 14202 1.91 AL 17-01-2021 100638 14917 2.05 AS 22-01-2021 5627 940 10.1 AS 23-01-2021 5881 948 10.56 AS 24-01-2021 7096 NaN 12.74 AS 25-01-2021 NaN 949 12.98 AS 26-01-2021 7230 950 13.23 AS 27-01-2021 8133 NaN 14.6
Advertisement
Answer
mask
+ duplicated
We can define a list of column names, then for each column inside a for loop mask
the duplicate values per unique location
cols = ['people_vaccinated', 'people_fully_vaccinated', 'people_vaccinated_per_hundred'] for c in cols: df[c] = df[c].mask(df.duplicated(['location', c]))
location date people_vaccinated people_fully_vaccinated people_vaccinated_per_hundred 0 AL 12-01-2021 70861.0 7270.0 1.45 1 AL 13-01-2021 74792.0 9245.0 1.53 2 AL 14-01-2021 80480.0 11366.0 1.64 3 AL 15-01-2021 86956.0 13488.0 1.77 4 AL 16-01-2021 93797.0 14202.0 1.91 5 AL 17-01-2021 100638.0 14917.0 2.05 6 AS 22-01-2021 5627.0 940.0 10.10 7 AS 23-01-2021 5881.0 948.0 10.56 8 AS 24-01-2021 7096.0 NaN 12.74 9 AS 25-01-2021 NaN 949.0 12.98 10 AS 26-01-2021 7230.0 950.0 13.23 11 AS 27-01-2021 8133.0 NaN 14.60