Skip to content
Advertisement

Pandas: forcing merge from multiple rows from Excel file into a single row(s) into single lines

I’ve been given a few sets of MS-Excel worksheets with a lot of nested data in areas, and I have researching for a few hours looking for a way to reduce each ‘id’ row to single rows. Specifically merging ‘Step ID’, ‘Install Steps’, and ‘Expected step’ into single lines with some formatting.

Here is shortened simple of the data within the Excel sheets I need to convert.

Name ID Host Step ID Install Step Expected step Extra
Test1 4 Cat 1 Move x to y x is with y x will protest
2 move x away from y x and y are not together y will protest
Test2 5 Dog 1 remove x from tank y is alone
2 Drop duplicate of y, y2 in tank y1 is not alone y1 will protest
3 Drop more duplicates of y into tank, y3 and y4 y1 and y2 will protest
test 3 6 Dog 1 empty tank nothing is in tank

And I am looking to transform this excel sheet into the following

Name ID Host Install Step Expected step Extra
Test1 4 Cat 1 – Move x to y
2 – move x away from y
1 – x is with y
2 – x and y are not together
1 – x will protest
2 – y will protest
Test2 5 Dog 1 – remove x from tank
2 – Drop duplicate of y, y2 in tank
3 – Drop more duplicates of y into tank, y3 and y4
1 – y is alone
2 – y1 is not alone

2 – y1 will protest <br / > y1 and y2 will protest
Test3 6 Dog 1 – empty tank 1 – nothing is in tank

I have testing a few of the other Stackoverflow questions and repsonses for pandas, but the few that closely match my need just fill in the empty areas with duplicate data.

Advertisement

Answer

If you melt() the dataframe:

melt = df.melt(['Name', 'ID', 'Host', 'Step ID']).ffill()

#      Name   ID  Host  Step ID      variable               value
# 0   Test1  4.0   Cat        1  Install Step         Move x to y
# 1   Test1  4.0   Cat        2  Install Step  move x away from y
# ...
# 16  Test2  5.0   Dog        3         Extra     y1 will protest
# 17  Test3  6.0   Dog        1         Extra     y1 will protest

You can combine the Step ID and value columns in one shot:

melt.value = melt['Step ID'].astype(str) + ' - ' + melt.value
melt = melt.drop('Step ID', axis=1)

#      Name   ID  Host      variable                   value
# 0   Test1  4.0   Cat  Install Step         1 - Move x to y
# 1   Test1  4.0   Cat  Install Step  2 - move x away from y
# ...
# 16  Test2  5.0   Dog         Extra     3 - y1 will protest
# 17  Test3  6.0   Dog         Extra     1 - y1 will protest

Then join each group’s value list together with n and unstack() to pivot back to the wide table:

melt.groupby(['Name', 'ID', 'Host', 'variable']).agg('n'.join).unstack()
Name ID Host Expected Step Extra Install Step
0 Test1 4.0 Cat 1 – x is with yn2 – x and y are not together 1 – x will protestn2 – y will protest 1 – Move x to yn2 – move x away from y
1 Test2 5.0 Dog 1 – y is alonen2 – y1 is not alonen3 – y1 an… 1 – y will protestn2 – y1 will protestn3 – y… 1 – remove x from tankn2 – Drop duplicate of …
2 Test3 6.0 Dog 1 – nothing is in tank 1 – y1 will protest 1 – empty tank
User contributions licensed under: CC BY-SA
9 People found this is helpful
Advertisement