I’ve been given a few sets of MS-Excel worksheets with a lot of nested data in areas, and I have researching for a few hours looking for a way to reduce each ‘id’ row to single rows. Specifically merging ‘Step ID’, ‘Install Steps’, and ‘Expected step’ into single lines with some formatting.
Here is shortened simple of the data within the Excel sheets I need to convert.
Name | ID | Host | Step ID | Install Step | Expected step | Extra |
---|---|---|---|---|---|---|
Test1 | 4 | Cat | 1 | Move x to y | x is with y | x will protest |
2 | move x away from y | x and y are not together | y will protest | |||
Test2 | 5 | Dog | 1 | remove x from tank | y is alone | |
2 | Drop duplicate of y, y2 in tank | y1 is not alone | y1 will protest | |||
3 | Drop more duplicates of y into tank, y3 and y4 | y1 and y2 will protest | ||||
test 3 | 6 | Dog | 1 | empty tank | nothing is in tank |
And I am looking to transform this excel sheet into the following
Name | ID | Host | Install Step | Expected step | Extra |
---|---|---|---|---|---|
Test1 | 4 | Cat | 1 – Move x to y 2 – move x away from y |
1 – x is with y 2 – x and y are not together |
1 – x will protest 2 – y will protest |
Test2 | 5 | Dog | 1 – remove x from tank 2 – Drop duplicate of y, y2 in tank 3 – Drop more duplicates of y into tank, y3 and y4 |
1 – y is alone 2 – y1 is not alone |
2 – y1 will protest <br / > y1 and y2 will protest |
Test3 | 6 | Dog | 1 – empty tank | 1 – nothing is in tank |
I have testing a few of the other Stackoverflow questions and repsonses for pandas, but the few that closely match my need just fill in the empty areas with duplicate data.
Advertisement
Answer
If you melt()
the dataframe:
melt = df.melt(['Name', 'ID', 'Host', 'Step ID']).ffill() # Name ID Host Step ID variable value # 0 Test1 4.0 Cat 1 Install Step Move x to y # 1 Test1 4.0 Cat 2 Install Step move x away from y # ... # 16 Test2 5.0 Dog 3 Extra y1 will protest # 17 Test3 6.0 Dog 1 Extra y1 will protest
You can combine the Step ID
and value
columns in one shot:
melt.value = melt['Step ID'].astype(str) + ' - ' + melt.value melt = melt.drop('Step ID', axis=1) # Name ID Host variable value # 0 Test1 4.0 Cat Install Step 1 - Move x to y # 1 Test1 4.0 Cat Install Step 2 - move x away from y # ... # 16 Test2 5.0 Dog Extra 3 - y1 will protest # 17 Test3 6.0 Dog Extra 1 - y1 will protest
Then join each group’s value
list together with n
and unstack()
to pivot back to the wide table:
melt.groupby(['Name', 'ID', 'Host', 'variable']).agg('n'.join).unstack()
Name | ID | Host | Expected Step | Extra | Install Step | |
---|---|---|---|---|---|---|
0 | Test1 | 4.0 | Cat | 1 – x is with yn2 – x and y are not together | 1 – x will protestn2 – y will protest | 1 – Move x to yn2 – move x away from y |
1 | Test2 | 5.0 | Dog | 1 – y is alonen2 – y1 is not alonen3 – y1 an… | 1 – y will protestn2 – y1 will protestn3 – y… | 1 – remove x from tankn2 – Drop duplicate of … |
2 | Test3 | 6.0 | Dog | 1 – nothing is in tank | 1 – y1 will protest | 1 – empty tank |