- I have dataframe like:
JavaScript
x
6
1
df = pd.DataFrame(np.array([['abc 33 aaa 9g98f 333', 'aaa'],
2
['cde aaa 95fwf', 'aaa'],
3
['12 faf bbb 92gcs', 'bbb'],
4
['faf bbb 7t87f', 'bbb']]),
5
columns=['column1', 'column2'])
6
len of column1 value may be different – from 2 to 5 words, so split with space not an option.
JavaScript
1
7
1
column1 column2
2
0 abc 33 aaa 9g98f 333 aaa
3
1 cde aaa 95fwf aaa
4
2 12 faf bbb 92gcs bbb
5
3 faf bbb 7t87f bbb
6
7
- Output should be like:
JavaScript
1
7
1
column1 new_column1 new_column2 column2
2
0 abc 33 aaa 9g98f abc 33 9g98f 333 aaa
3
1 cde aaa 95fwf cde 95fwf aaa
4
2 faf bbb 92gcs faf 92gcs bbb
5
3 12 faf bbb 7t87f 12 faf 7t87f bbb
6
7
That topic – How to split a dataframe string column into two columns? – didn’t help coz of separator
UPD. Left “side” may have 2-5 words – and right side too.
Advertisement
Answer
option 1
Splitting on spaces is an option, if you have a single word for the last two columns. Use rsplit
:
JavaScript
1
2
1
df['column1'].str.rsplit(n=2, expand=True)
2
output:
JavaScript
1
6
1
0 1 2
2
0 abc 33 aaa 9g98f
3
1 cde aaa 95fwf
4
2 12 faf bbb 92gcs
5
3 faf bbb 7t87f
6
NB. this doesn’t work with the updated example
option 2
Alternatively, to split on the provided delimiter:
JavaScript
1
3
1
df[['new_column1', 'new_column2']] = [a.split(f' {b} ') for a,b in
2
zip(df['column1'], df['column2'])]
3
output:
JavaScript
1
6
1
column1 column2 new_column1 new_column2
2
0 abc 33 aaa 9g98f 333 aaa abc 33 9g98f 333
3
1 cde aaa 95fwf aaa cde 95fwf
4
2 12 faf bbb 92gcs bbb 12 faf 92gcs
5
3 faf bbb 7t87f bbb faf 7t87f
6
option 3
Finally, if you have many time the same delimiters and many rows, it might be worth using vectorial splitting per group:
JavaScript
1
5
1
(df
2
.groupby('column2')
3
.apply(lambda g: g['column1'].str.split(f's*{g.name}s*', expand=True))
4
)
5
output:
JavaScript
1
6
1
0 1
2
0 abc 33 9g98f 333
3
1 cde 95fwf
4
2 12 faf 92gcs
5
3 faf 7t87f
6