I have a dataframe with missing values. for each index in a column group, i want to replace these values seperately. If all of the values in a group are missing, i want to replace the values with 1. If only some of the values are missing, i want to replace it with data from an imputed dataframe
dataframe 1
| index | d0_1 | d0_2 | d1_1 | d1_2 | 
|---|---|---|---|---|
| group | d0 | d0 | d1 | d1 | 
| 1 | 3 | 3 | NaN | NaN | 
| 2 | 3 | NaN | 3 | 3 | 
dataframe 2 (the imputed one)
| index | d0_1 | d0_2 | d1_1 | d1_2 | 
|---|---|---|---|---|
| group | d0 | d0 | d1 | d1 | 
| 1 | 3 | 3 | 2 | 2 | 
| 2 | 3 | 2 | 3 | 3 | 
output:
| index | d0_1 | d0_2 | d1_1 | d1_2 | 
|---|---|---|---|---|
| group | d0 | d0 | d1 | d1 | 
| 1 | 3 | 3 | 1 | 1 | 
| 2 | 3 | 2 | 3 | 3 | 
my data is much larger and the groups are larger as well.
Ive been struggling now with this for days, i just cant seem to find a working solution
my current solution is iterating over all the groups, and using groupby.transform to replace values, but i dont know how to tell the lambda function to take the values from my second data frame, and my current lambda function also doesnt replace all the values with 1 either, instead just returning the old groups with no changes
df1 = pd.read_csv("file.txt", sep = "t", index_col = "T: Protein.Group")
def group(a: pd.DataFrame):
    a_grouped = a.groupby(["group"] , axis=1)
    return a_grouped
def getgroup(a: pd.DataFrame):
    new_idx = pd.MultiIndex.from_arrays([
        a.columns,
        a.columns.str.extract("(dd+)_d+", expand = False)
    ], names=["index", "group"])
    a.columns = new_idx
    return a
df1grp = group(getgroup(df1))
for i in list(df1grg.groups.keys())
        df1grp.get_group(i).transform(
            lambda x: 1 if x.eq(np.nan).all() else x
        )
Advertisement
Answer
IIUC:
df = df1.mask(df1.groupby('group', axis=1).count() == 0, 1)
df = df.where(~df.isna(), df2)
>>> df
index d0_1 d0_2 d1_1 d1_2
group   d0   d0   d1   d1
1        3    3    1    1
2        3    2    3    3
This is assuming the columns are indeed a MultiIndex as you describe, e.g.:
>>> df1.columns
MultiIndex([('d0_1', 'd0'),
            ('d0_2', 'd0'),
            ('d1_1', 'd1'),
            ('d1_2', 'd1')],
           names=['index', 'group'])
(I initially spent a few minutes making such columns as I thought you had simple Index columns and a first row with the 'group'…)