- I have dataframe like:
df = pd.DataFrame(np.array([['abc 33 aaa 9g98f 333', 'aaa'], ['cde aaa 95fwf', 'aaa'], ['12 faf bbb 92gcs', 'bbb'], ['faf bbb 7t87f', 'bbb']]), columns=['column1', 'column2'])
len of column1 value may be different – from 2 to 5 words, so split with space not an option.
column1 column2 0 abc 33 aaa 9g98f 333 aaa 1 cde aaa 95fwf aaa 2 12 faf bbb 92gcs bbb 3 faf bbb 7t87f bbb
- Output should be like:
column1 new_column1 new_column2 column2 0 abc 33 aaa 9g98f abc 33 9g98f 333 aaa 1 cde aaa 95fwf cde 95fwf aaa 2 faf bbb 92gcs faf 92gcs bbb 3 12 faf bbb 7t87f 12 faf 7t87f bbb
That topic – How to split a dataframe string column into two columns? – didn’t help coz of separator
UPD. Left “side” may have 2-5 words – and right side too.
Advertisement
Answer
option 1
Splitting on spaces is an option, if you have a single word for the last two columns. Use rsplit
:
df['column1'].str.rsplit(n=2, expand=True)
output:
0 1 2 0 abc 33 aaa 9g98f 1 cde aaa 95fwf 2 12 faf bbb 92gcs 3 faf bbb 7t87f
NB. this doesn’t work with the updated example
option 2
Alternatively, to split on the provided delimiter:
df[['new_column1', 'new_column2']] = [a.split(f' {b} ') for a,b in zip(df['column1'], df['column2'])]
output:
column1 column2 new_column1 new_column2 0 abc 33 aaa 9g98f 333 aaa abc 33 9g98f 333 1 cde aaa 95fwf aaa cde 95fwf 2 12 faf bbb 92gcs bbb 12 faf 92gcs 3 faf bbb 7t87f bbb faf 7t87f
option 3
Finally, if you have many time the same delimiters and many rows, it might be worth using vectorial splitting per group:
(df .groupby('column2') .apply(lambda g: g['column1'].str.split(f's*{g.name}s*', expand=True)) )
output:
0 1 0 abc 33 9g98f 333 1 cde 95fwf 2 12 faf 92gcs 3 faf 7t87f