Defining Parent For a Dataset with Several Conditions in Pandas

Question

I have a CSV file with more than 10,000,000 rows of data with below structures: I have an ID as my uniqueID per group: Data Format For defining parent relationship below conditions exist: Each group MUST has 1 Head. It is OPTIONAL to have ONLY 1 Senior in each group. Each group MUST have AT LEAST one Junior. EXPECTED RESULT

Accepted Answer

You could pivot the Type and Name columns then forword fill within ID group.  Then take the right-hand two non-NaN entries to get the Parent and Name.Pivot and forward-fill:dfn = pd.concat([df[['ID','Type']], df.pivot(columns='Type', values='Name')], axis=1)     .groupby('ID').apply(lambda x: x.ffill())[['ID','Type','Head','Senior','Junior']]print(dfn)   ID    Type     Head   Senior   Junior0   1    Head  abc-001      NaN      NaN1   1  Senior  abc-001  abc-002      NaN2   1  Junior  abc-001  abc-002  abc-0033   1  Junior  abc-001  abc-002  abc-0044   2    Head  abc-005      NaN      NaN5   2  Senior  abc-005  abc-006      NaN6   2  Junior  abc-005  abc-006  abc-0077   3    Head  abc-008      NaN      NaN8   3  Junior  abc-008      NaN  abc-009A function to pull the last two non-NaN entries:def get_np(x):    rc = [np.nan,np.nan]        if x.isna().sum() != 2:        if x.isna().sum() == 0:            rc = [x['Junior'],x['Senior']]        elif pd.isna(x['Junior']):            rc = [x['Senior'],x['Head']]        else:            rc = [x['Junior'],x['Head']]       return pd.concat([x[['ID','Type']], pd.Series(rc, index=['Name','Parent'])])    Apply it and drop the non-applicable rows:dfn.apply(get_np, axis=1).dropna()   ID    Type     Name   Parent1   1  Senior  abc-002  abc-0012   1  Junior  abc-003  abc-0023   1  Junior  abc-004  abc-0025   2  Senior  abc-006  abc-0056   2  Junior  abc-007  abc-0068   3  Junior  abc-009  abc-008

Advertisement

Answer