Defining Parent For a Dataset with Several Conditions in Pandas

Question

I have a CSV file with more than 10,000,000 rows of data with below structures: I have an ID as my uniqueID per group: Data Format For defining parent relationship below conditions exist: Each group MUST has 1 Head. It is OPTIONAL to have ONLY 1 Senior in each group. Each group MUST have AT LEAST one Junior. …

Accepted Answer

You could pivot the Type and Name columns then forword fill within ID group.  Then take the right-hand two non-NaN entries to get the Parent and Name.Pivot and forward-fill:dfn = pd.concat([df[['ID','Type']], df.pivot(columns='Type', values='Name')], axis=1)     .groupby('ID').apply(lambda x: x.ffill())[['ID','Type','Head','Senior','Junior']]print(dfn)   ID    Type     Head   Senior   Junior0   1    Head  abc-001      NaN      NaN1   1  Senior  abc-001  abc-002      NaN2   1  Junior  abc-001  abc-002  abc-0033   1  Junior  abc-001  abc-002  abc-0044   2    Head  abc-005      NaN      NaN5   2  Senior  abc-005  abc-006      NaN6   2  Junior  abc-005  abc-006  abc-0077   3    Head  abc-008      NaN      NaN8   3  Junior  abc-008      NaN  abc-009A function to pull the last two non-NaN entries:def get_np(x):    rc = [np.nan,np.nan]        if x.isna().sum() != 2:        if x.isna().sum() == 0:            rc = [x['Junior'],x['Senior']]        elif pd.isna(x['Junior']):            rc = [x['Senior'],x['Head']]        else:            rc = [x['Junior'],x['Head']]       return pd.concat([x[['ID','Type']], pd.Series(rc, index=['Name','Parent'])])    Apply it and drop the non-applicable rows:dfn.apply(get_np, axis=1).dropna()   ID    Type     Name   Parent1   1  Senior  abc-002  abc-0012   1  Junior  abc-003  abc-0023   1  Junior  abc-004  abc-0025   2  Senior  abc-006  abc-0056   2  Junior  abc-007  abc-0068   3  Junior  abc-009  abc-008

Advertisement

Answer