I would like to fill the empty rows of the dataframe:
df= pd.DataFrame({ 'id':['A', 'B', 'C','D','J','K','Z','Y','H','G'], 'test1':[10, 9, 8,7,np.nan,6,np.nan,np.nan,5,np.nan] })
based on the following dictionary:
dic1={'A': [['K', 'J'], 2.0], 'B': [nan, nan], 'C': [['Y'], 1.0], 'D': [['B', 'C'], 2.0], 'J': [nan, nan], 'K': [nan, nan], 'G': [['A', 'H'], 2.0], 'Y': [['Z'], 1.0], 'H': [nan, nan], 'Z': [['G'], 1.0]}
dict1
shows how many kids each id in df has. For instance, A is parent of K and J. J has no kids. G has A and H.
The empty rows in df are belongs to id J,Y,Z, and G.The list gives us these ids: new_list=['J', 'G', 'Y', 'Z']
I want to fill the rows for these ids by following the rules as:
If id has no kid (NaN) in dic1 then assign zero to test1
If the id has 1 kid and that is not new (not in new_list), then fill the test1 with the test1 value of the kid.
If the id has more than 1 kid (i.e [‘A’,’H’] for id=G) and none of them is new, then fill the test1 with max of test1 values of all the kids (i.e all the kids means both A and H from [A, H]).
If the id has kids which is new then do the processes 1,2,3 for that kids then do the the processes 1,2,3 for that id in question. If the kid is also new and it hasnt filled the df yet, then do the steps first for that kid first.
So far, I have managed to achieve the processes 1,2,3 but I do not know how I can deal with the process 4 (how to integrate my code for it). The correct output should be like:
df= pd.DataFrame({ 'id':['A', 'B', 'C','D','J','K','Z','Y','H','G'], 'test1':[10, 9, 8,7,0,6,10,10,5,10] })
My code so far is:
new_list=['J', 'G', 'Y', 'Z'] dic_df=dict(zip(df.task_id, df.test1)) act_aa={} def test_newcase(i): if str(dic1[i][0])=='nan': df.loc[df['task_id'] == i, ['test1']] = 0 else: if any(x not in new_list for x in dic1[i][0]): if dic1[i][1]==1.0: for k in dic1[i][0]: df.loc[df['task_id'] == i, ['test1','test2']] = dic_df[k][0] else: for k in dic1[i][0]: act_aa[k]=str(dic_df[k])[0] if act_aa : df.loc[df['task_id'] == i, ['test1']] = max(act_aa.values()) for i in new_list: test_newcase(i) df
The elements (id) are being updated as the conditions are processed and if the id has no kids (NaN), df will be updated with right information and the id which is in new_list will be studied. However, the algorithm can start from another id from the new_list so in that case the algorithm should check whether the id has kids from the ‘new_list’ and if it does it should first fill the information of that kid and then come back to the id in question. For instance, if we check at first Y
then we should check its kid which is Z
. If Z
values in df hasn’t been filled up, we should first fill that and then check for Y
. If Z
hasnt been filled then we should check its kids and see whether they are filled or not, so on.
Any help would be appreciated.
Advertisement
Answer
The code below works, but let me explain what I did to make it work.
First of all, the structure is similar to yours: I use a for-loop to loop over the kids in new_list
. Then I check if rule 1 needs to be applied, else
rule 2 or 3 needs to be applied. For rule 2, I check if there is only one kid and if the kid is not in new_list
. For rule 3, more than one kids need to be there and ALL of them need to be NOT in new_list
.
Then the trick for rule 4 is to use a while loop: do rule 123 and then compute again the new_list
. If the length of this list is larger than zero. We do rule 123 again, updating our dataframe again.
In the first loop kid J
and G
get their value.
Second loop, kid Z
gets his value.
Finally, kid Y
gets the value.
new_list = list(df.id[df.test1.isnull()]) while len(new_list) > 0: for i in new_list: if str(dic1[i][0]) == 'nan': # rule 1 df.loc[df.id == i, 'test1'] = 0 else: if dic1[i][1] == 1: kid = dic1[i][0][0] if kid not in new_list: # rule 2 df.loc[df.id == i, 'test1'] = df.test1[df.id == kid].values[0] else: kids = dic1[i][0] if all(kid not in new_list for kid in kids): # rule 3 max_value = df.test1[df.id.isin(kids)].max() df.loc[df.id == i, 'test1'] = max_value new_list = list(df.id[df.test1.isnull()])