Skip to content
Advertisement

Spitting a column based on a delimiter

I would like to extract some information from a column in my dataframe:

Example

Col
7 points  — it is an example ...
13 points  — as above ...
some other text ...
1 point  — "what to say more?"
13 points  — ...
11 points  — 1234 ...

I was using str.contain to extract the first part (i.e., all the information before the first dash, where there is.

m = (df['Col'].str.contains(r'(?i)^d+spoint | points'))
df[m]

I am still getting the same original column (so no extraction). My output would consist in two columns, one without points information (Col1) and another one (Col2) with the text extracted.

Col1
7 points  
13 points 
# need to still keep the row, even if empty
1 point 
13 points
11 points

and

Col2       
it is an example ...
as above ...
some other text ...
"what to say more?"
...                                                   
1234 ...

It is important to consider the first dash where there is, since I might have more dash included in the text. It seems to be this symbol -, but maybe it can be a longer dash. I copied and pasted from my dataset, but copying it here it seems to be slightly different.

Advertisement

Answer

Try using str.extract with Regex.

Ex:

import pandas as pd

df[['Col1', 'Col2']] = df['Col'].str.extract(r"(d+ points?)?s*—?s*(.*)", expand=True)
print(df)

Output:

                                Col       Col1                  Col2
0  7 points  — it is an example ...   7 points  it is an example ...
1         13 points  — as above ...  13 points          as above ...
2               some other text ...        NaN   some other text ...
3    1 point  — "what to say more?"    1 point   "what to say more?"
4                  13 points  — ...  13 points                   ...
5             11 points  — 1234 ...  11 points              1234 ...
User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement