I have a dataframe with missing values. for each index in a column group, i want to replace these values seperately. If all of the values in a group are missing, i want to replace the values with 1. If only some of the values are missing, i want to replace it with data from an imputed dataframe
dataframe 1
index | d0_1 | d0_2 | d1_1 | d1_2 |
---|---|---|---|---|
group | d0 | d0 | d1 | d1 |
1 | 3 | 3 | NaN | NaN |
2 | 3 | NaN | 3 | 3 |
dataframe 2 (the imputed one)
index | d0_1 | d0_2 | d1_1 | d1_2 |
---|---|---|---|---|
group | d0 | d0 | d1 | d1 |
1 | 3 | 3 | 2 | 2 |
2 | 3 | 2 | 3 | 3 |
output:
index | d0_1 | d0_2 | d1_1 | d1_2 |
---|---|---|---|---|
group | d0 | d0 | d1 | d1 |
1 | 3 | 3 | 1 | 1 |
2 | 3 | 2 | 3 | 3 |
my data is much larger and the groups are larger as well.
Ive been struggling now with this for days, i just cant seem to find a working solution
my current solution is iterating over all the groups, and using groupby.transform to replace values, but i dont know how to tell the lambda function to take the values from my second data frame, and my current lambda function also doesnt replace all the values with 1 either, instead just returning the old groups with no changes
df1 = pd.read_csv("file.txt", sep = "t", index_col = "T: Protein.Group") def group(a: pd.DataFrame): a_grouped = a.groupby(["group"] , axis=1) return a_grouped def getgroup(a: pd.DataFrame): new_idx = pd.MultiIndex.from_arrays([ a.columns, a.columns.str.extract("(dd+)_d+", expand = False) ], names=["index", "group"]) a.columns = new_idx return a df1grp = group(getgroup(df1)) for i in list(df1grg.groups.keys()) df1grp.get_group(i).transform( lambda x: 1 if x.eq(np.nan).all() else x )
Advertisement
Answer
IIUC:
df = df1.mask(df1.groupby('group', axis=1).count() == 0, 1) df = df.where(~df.isna(), df2) >>> df index d0_1 d0_2 d1_1 d1_2 group d0 d0 d1 d1 1 3 3 1 1 2 3 2 3 3
This is assuming the columns are indeed a MultiIndex
as you describe, e.g.:
>>> df1.columns MultiIndex([('d0_1', 'd0'), ('d0_2', 'd0'), ('d1_1', 'd1'), ('d1_2', 'd1')], names=['index', 'group'])
(I initially spent a few minutes making such columns as I thought you had simple Index columns and a first row with the 'group'
…)