Skip to content
Advertisement

Defining Parent For a Dataset with Several Conditions in Pandas

I have a CSV file with more than 10,000,000 rows of data with below structures: I have an ID as my uniqueID per group:

Data Format

ID      Type        Name
1       Head        abc-001
1       Senior      abc-002
1       Junior      abc-003
1       Junior      abc-004     
2       Head        abc-005     
2       Senior      abc-006 
2       Junior      abc-007 
3       Head        abc-008     
3       Junior      abc-009     
...

For defining parent relationship below conditions exist:

  1. Each group MUST has 1 Head.
  2. It is OPTIONAL to have ONLY 1 Senior in each group.
  3. Each group MUST have AT LEAST one Junior.

EXPECTED RESULT

ID      Type        Name        Parent
1       Senior      abc-002     abc-001
1       Junior      abc-003     abc-002
1       Junior      abc-004     abc-002
2       Senior      abc-006     abc-005
2       Junior      abc-007     abc-006
3       Junior      abc-009     abc-008

Below code works when I have one Junior, I want to know if there is any way to define parent for more than one juniors:

order = ['Head', 'Senior', 'Junior']
key = pd.Series({x: i for i,x in enumerate(order)})
df2 = df.sort_values(by='Type', key=key.get)
df4=df.join(df2.groupby('IP')['Type'].shift().dropna().rename('Parent'),how='right')
print(df4)

Advertisement

Answer

You could pivot the Type and Name columns then forword fill within ID group. Then take the right-hand two non-NaN entries to get the Parent and Name.

Pivot and forward-fill:

dfn = pd.concat([df[['ID','Type']], df.pivot(columns='Type', values='Name')], axis=1) 
    .groupby('ID').apply(lambda x: x.ffill())[['ID','Type','Head','Senior','Junior']]
print(dfn)

   ID    Type     Head   Senior   Junior
0   1    Head  abc-001      NaN      NaN
1   1  Senior  abc-001  abc-002      NaN
2   1  Junior  abc-001  abc-002  abc-003
3   1  Junior  abc-001  abc-002  abc-004
4   2    Head  abc-005      NaN      NaN
5   2  Senior  abc-005  abc-006      NaN
6   2  Junior  abc-005  abc-006  abc-007
7   3    Head  abc-008      NaN      NaN
8   3  Junior  abc-008      NaN  abc-009

A function to pull the last two non-NaN entries:

def get_np(x):
    rc = [np.nan,np.nan]
    
    if x.isna().sum() != 2:
        if x.isna().sum() == 0:
            rc = [x['Junior'],x['Senior']]
        elif pd.isna(x['Junior']):
            rc = [x['Senior'],x['Head']]
        else:
            rc = [x['Junior'],x['Head']]
   
    return pd.concat([x[['ID','Type']], pd.Series(rc, index=['Name','Parent'])])
    

Apply it and drop the non-applicable rows:

dfn.apply(get_np, axis=1).dropna()

   ID    Type     Name   Parent
1   1  Senior  abc-002  abc-001
2   1  Junior  abc-003  abc-002
3   1  Junior  abc-004  abc-002
5   2  Senior  abc-006  abc-005
6   2  Junior  abc-007  abc-006
8   3  Junior  abc-009  abc-008 
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement