I have a df that looks like this :
JavaScript
x
7
1
df = pd.DataFrame(data={
2
'id': ["idx1"],
3
'test_A': ["ABC", ],
4
'test_B': ["X"],
5
'test_C': ["ABC / XYZ"],
6
'test_D': ["ABC / JKL / XYZ"]})
7
It’s an example for one row but there are thousands of rows. I want to explode each value where there are multiple values in these four “TEST” columns ie. I want each one to duplicate the row for each one of the “test” that is the same and if there are many (separed by a “/”) i want to duplicate too. I want an “X” if the duplication contains nothing.
Of course there are a lot of other columns, for which the values doesn’t matter for the duplication: it should stay the same. I’m not sure how to explain it, but here’s an example of the output I want for the input I provided:
JavaScript
1
7
1
df = pd.DataFrame(data={
2
'id': ["idx1", "idx1", "idx1"],
3
'test_A': ["ABC", "X", "X"],
4
'test_B': ["X", "X", "X"],
5
'test_C': ["ABC", "XYZ", "X"],
6
'test_D': ["ABC", "XYZ", "JKL"]})
7
Advertisement
Answer
Here is an alternative, using a helper function:
JavaScript
1
13
13
1
def split(df):
2
return (df.apply(lambda c: c.str.split(' / ')) # split cells
3
.apply(lambda x: x.explode().reset_index(drop=True)) # explode
4
.fillna({c: 'X' for c in df.filter(like='test_').columns}) # fill missing test with X
5
.ffill() # fill non-test columns
6
)
7
8
## single row
9
split(df)
10
11
## multiple rows
12
df.groupby('id').apply(split).droplevel(0)
13
output:
JavaScript
1
5
1
id test_A test_B test_C test_D
2
0 idx1 ABC X ABC ABC
3
1 idx1 X X XYZ JKL
4
2 idx1 X X X XYZ
5
output on @jezrael’s better example:
JavaScript
1
8
1
id test_A test_B test_C test_D
2
0 idx1 ABC X ABC ABC
3
1 idx1 X X XYZ JKL
4
2 idx1 X X X XYZ
5
0 idx2 SSD "ABC aa ABC
6
1 idx2 X JKL X JKL
7
2 idx2 X XYZ X X
8