Skip to content
Advertisement

Split column to multiple columns by another column value (complicated separator)

  1. I have dataframe like:
df = pd.DataFrame(np.array([['abc 33 aaa 9g98f 333', 'aaa'],
                            ['cde aaa 95fwf', 'aaa'],
                            ['12 faf bbb 92gcs', 'bbb'],
                            ['faf bbb 7t87f', 'bbb']]),
                            columns=['column1', 'column2'])

len of column1 value may be different – from 2 to 5 words, so split with space not an option.

            column1     column2
0  abc 33 aaa 9g98f 333   aaa
1     cde aaa 95fwf       aaa
2  12 faf bbb 92gcs       bbb
3     faf bbb 7t87f       bbb

  1. Output should be like:
            column1 new_column1 new_column2      column2
0  abc 33 aaa 9g98f      abc 33       9g98f 333     aaa
1     cde aaa 95fwf         cde       95fwf         aaa
2     faf bbb 92gcs         faf       92gcs         bbb
3  12 faf bbb 7t87f      12 faf       7t87f         bbb

That topic – How to split a dataframe string column into two columns? – didn’t help coz of separator

UPD. Left “side” may have 2-5 words – and right side too.

Advertisement

Answer

option 1

Splitting on spaces is an option, if you have a single word for the last two columns. Use rsplit:

df['column1'].str.rsplit(n=2, expand=True)

output:

        0    1      2
0  abc 33  aaa  9g98f
1     cde  aaa  95fwf
2  12 faf  bbb  92gcs
3     faf  bbb  7t87f

NB. this doesn’t work with the updated example

option 2

Alternatively, to split on the provided delimiter:

df[['new_column1', 'new_column2']] = [a.split(f' {b} ') for a,b in
                                      zip(df['column1'], df['column2'])]

output:

                column1 column2 new_column1 new_column2
0  abc 33 aaa 9g98f 333     aaa      abc 33   9g98f 333
1         cde aaa 95fwf     aaa         cde       95fwf
2      12 faf bbb 92gcs     bbb      12 faf       92gcs
3         faf bbb 7t87f     bbb         faf       7t87f

option 3

Finally, if you have many time the same delimiters and many rows, it might be worth using vectorial splitting per group:

(df
 .groupby('column2')
 .apply(lambda g: g['column1'].str.split(f's*{g.name}s*', expand=True)) 
)

output:

        0          1
0  abc 33  9g98f 333
1     cde      95fwf
2  12 faf      92gcs
3     faf      7t87f
User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement