I have a df that looks like this :
df = pd.DataFrame(data={ 'id': ["idx1"], 'test_A': ["ABC", ], 'test_B': ["X"], 'test_C': ["ABC / XYZ"], 'test_D': ["ABC / JKL / XYZ"]})
It’s an example for one row but there are thousands of rows. I want to explode each value where there are multiple values in these four “TEST” columns ie. I want each one to duplicate the row for each one of the “test” that is the same and if there are many (separed by a “/”) i want to duplicate too. I want an “X” if the duplication contains nothing.
Of course there are a lot of other columns, for which the values doesn’t matter for the duplication: it should stay the same. I’m not sure how to explain it, but here’s an example of the output I want for the input I provided:
df = pd.DataFrame(data={ 'id': ["idx1", "idx1", "idx1"], 'test_A': ["ABC", "X", "X"], 'test_B': ["X", "X", "X"], 'test_C': ["ABC", "XYZ", "X"], 'test_D': ["ABC", "XYZ", "JKL"]})
Advertisement
Answer
Here is an alternative, using a helper function:
def split(df): return (df.apply(lambda c: c.str.split(' / ')) # split cells .apply(lambda x: x.explode().reset_index(drop=True)) # explode .fillna({c: 'X' for c in df.filter(like='test_').columns}) # fill missing test with X .ffill() # fill non-test columns ) ## single row split(df) ## multiple rows df.groupby('id').apply(split).droplevel(0)
output:
id test_A test_B test_C test_D 0 idx1 ABC X ABC ABC 1 idx1 X X XYZ JKL 2 idx1 X X X XYZ
output on @jezrael’s better example:
id test_A test_B test_C test_D 0 idx1 ABC X ABC ABC 1 idx1 X X XYZ JKL 2 idx1 X X X XYZ 0 idx2 SSD "ABC aa ABC 1 idx2 X JKL X JKL 2 idx2 X XYZ X X