I have a particular problem, I would like to clean and prepare my data and I have a lot of unknown values for the “highpoint_metres” column of my dataframe (members). As there is no missing information for the “peak_id”, I calculated the median value of the height according to the peak_id to be more accurate. I would like to do two steps: 1) add a new column to my “members” dataframe where there would be the value of the median but different depending on the “peak_id” (value calculated thanks to the code in the question). 2) That the code checks that the value in highpoint_metres is null, if it is, that the value of the new column is put instead. I don’t know if this is clearer
code :
import pandas as pd members = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv") print(members) mediane_peak_id = members[["peak_id","highpoint_metres"]].groupby("peak_id",as_index=False).median()
And I don’t know how to continue from there (my level of python is very bad ;-))
Advertisement
Answer
I believe that’s what you’re looking for:
import numpy as np import pandas as pd members = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/members.csv") median_highpoint_by_peak = members.groupby("peak_id")["highpoint_metres"].transform("median") is_highpoint_missing = np.isnan(members.highpoint_metres) members["highpoint_meters_imputed"] = np.where(is_highpoint_missing, median_highpoint_by_peak, members.highpoint_metres)