I have a CSV file with more than 10,000,000 rows of data with below structures: I have an ID as my uniqueID per group:
Data Format
JavaScript
x
12
12
1
ID Type Name
2
1 Head abc-001
3
1 Senior abc-002
4
1 Junior abc-003
5
1 Junior abc-004
6
2 Head abc-005
7
2 Senior abc-006
8
2 Junior abc-007
9
3 Head abc-008
10
3 Junior abc-009
11
12
For defining parent relationship below conditions exist:
- Each group MUST has 1 Head.
- It is OPTIONAL to have ONLY 1 Senior in each group.
- Each group MUST have AT LEAST one Junior.
EXPECTED RESULT
JavaScript
1
8
1
ID Type Name Parent
2
1 Senior abc-002 abc-001
3
1 Junior abc-003 abc-002
4
1 Junior abc-004 abc-002
5
2 Senior abc-006 abc-005
6
2 Junior abc-007 abc-006
7
3 Junior abc-009 abc-008
8
Below code works when I have one Junior, I want to know if there is any way to define parent for more than one juniors:
JavaScript
1
6
1
order = ['Head', 'Senior', 'Junior']
2
key = pd.Series({x: i for i,x in enumerate(order)})
3
df2 = df.sort_values(by='Type', key=key.get)
4
df4=df.join(df2.groupby('IP')['Type'].shift().dropna().rename('Parent'),how='right')
5
print(df4)
6
Advertisement
Answer
You could pivot the Type
and Name
columns then forword fill within ID
group. Then take the right-hand two non-NaN entries to get the Parent
and Name
.
Pivot and forward-fill:
JavaScript
1
15
15
1
dfn = pd.concat([df[['ID','Type']], df.pivot(columns='Type', values='Name')], axis=1)
2
.groupby('ID').apply(lambda x: x.ffill())[['ID','Type','Head','Senior','Junior']]
3
print(dfn)
4
5
ID Type Head Senior Junior
6
0 1 Head abc-001 NaN NaN
7
1 1 Senior abc-001 abc-002 NaN
8
2 1 Junior abc-001 abc-002 abc-003
9
3 1 Junior abc-001 abc-002 abc-004
10
4 2 Head abc-005 NaN NaN
11
5 2 Senior abc-005 abc-006 NaN
12
6 2 Junior abc-005 abc-006 abc-007
13
7 3 Head abc-008 NaN NaN
14
8 3 Junior abc-008 NaN abc-009
15
A function to pull the last two non-NaN entries:
JavaScript
1
14
14
1
def get_np(x):
2
rc = [np.nan,np.nan]
3
4
if x.isna().sum() != 2:
5
if x.isna().sum() == 0:
6
rc = [x['Junior'],x['Senior']]
7
elif pd.isna(x['Junior']):
8
rc = [x['Senior'],x['Head']]
9
else:
10
rc = [x['Junior'],x['Head']]
11
12
return pd.concat([x[['ID','Type']], pd.Series(rc, index=['Name','Parent'])])
13
14
Apply it and drop the non-applicable rows:
JavaScript
1
10
10
1
dfn.apply(get_np, axis=1).dropna()
2
3
ID Type Name Parent
4
1 1 Senior abc-002 abc-001
5
2 1 Junior abc-003 abc-002
6
3 1 Junior abc-004 abc-002
7
5 2 Senior abc-006 abc-005
8
6 2 Junior abc-007 abc-006
9
8 3 Junior abc-009 abc-008
10