I have a CSV file with more than 10,000,000 rows of data with below structures: I have an ID as my uniqueID per group:
Data Format
ID Type Name 1 Head abc-001 1 Senior abc-002 1 Junior abc-003 1 Junior abc-004 2 Head abc-005 2 Senior abc-006 2 Junior abc-007 3 Head abc-008 3 Junior abc-009 ...
For defining parent relationship below conditions exist:
- Each group MUST has 1 Head.
- It is OPTIONAL to have ONLY 1 Senior in each group.
- Each group MUST have AT LEAST one Junior.
EXPECTED RESULT
ID Type Name Parent 1 Senior abc-002 abc-001 1 Junior abc-003 abc-002 1 Junior abc-004 abc-002 2 Senior abc-006 abc-005 2 Junior abc-007 abc-006 3 Junior abc-009 abc-008
Below code works when I have one Junior, I want to know if there is any way to define parent for more than one juniors:
order = ['Head', 'Senior', 'Junior'] key = pd.Series({x: i for i,x in enumerate(order)}) df2 = df.sort_values(by='Type', key=key.get) df4=df.join(df2.groupby('IP')['Type'].shift().dropna().rename('Parent'),how='right') print(df4)
Advertisement
Answer
You could pivot the Type
and Name
columns then forword fill within ID
group. Then take the right-hand two non-NaN entries to get the Parent
and Name
.
Pivot and forward-fill:
dfn = pd.concat([df[['ID','Type']], df.pivot(columns='Type', values='Name')], axis=1) .groupby('ID').apply(lambda x: x.ffill())[['ID','Type','Head','Senior','Junior']] print(dfn) ID Type Head Senior Junior 0 1 Head abc-001 NaN NaN 1 1 Senior abc-001 abc-002 NaN 2 1 Junior abc-001 abc-002 abc-003 3 1 Junior abc-001 abc-002 abc-004 4 2 Head abc-005 NaN NaN 5 2 Senior abc-005 abc-006 NaN 6 2 Junior abc-005 abc-006 abc-007 7 3 Head abc-008 NaN NaN 8 3 Junior abc-008 NaN abc-009
A function to pull the last two non-NaN entries:
def get_np(x): rc = [np.nan,np.nan] if x.isna().sum() != 2: if x.isna().sum() == 0: rc = [x['Junior'],x['Senior']] elif pd.isna(x['Junior']): rc = [x['Senior'],x['Head']] else: rc = [x['Junior'],x['Head']] return pd.concat([x[['ID','Type']], pd.Series(rc, index=['Name','Parent'])])
Apply it and drop the non-applicable rows:
dfn.apply(get_np, axis=1).dropna() ID Type Name Parent 1 1 Senior abc-002 abc-001 2 1 Junior abc-003 abc-002 3 1 Junior abc-004 abc-002 5 2 Senior abc-006 abc-005 6 2 Junior abc-007 abc-006 8 3 Junior abc-009 abc-008