I would like to extract some information from a column in my dataframe:
Example
Col 7 points — it is an example ... 13 points — as above ... some other text ... 1 point — "what to say more?" 13 points — ... 11 points — 1234 ...
I was using str.contain to extract the first part (i.e., all the information before the first dash, where there is.
m = (df['Col'].str.contains(r'(?i)^d+spoint | points')) df[m]
I am still getting the same original column (so no extraction). My output would consist in two columns, one without points information (Col1) and another one (Col2) with the text extracted.
Col1 7 points 13 points # need to still keep the row, even if empty 1 point 13 points 11 points
and
Col2 it is an example ... as above ... some other text ... "what to say more?" ... 1234 ...
It is important to consider the first dash where there is, since I might have more dash included in the text.
It seems to be this symbol -
, but maybe it can be a longer dash. I copied and pasted from my dataset, but copying it here it seems to be slightly different.
Advertisement
Answer
Try using str.extract
with Regex.
Ex:
import pandas as pd df[['Col1', 'Col2']] = df['Col'].str.extract(r"(d+ points?)?s*—?s*(.*)", expand=True) print(df)
Output:
Col Col1 Col2 0 7 points — it is an example ... 7 points it is an example ... 1 13 points — as above ... 13 points as above ... 2 some other text ... NaN some other text ... 3 1 point — "what to say more?" 1 point "what to say more?" 4 13 points — ... 13 points ... 5 11 points — 1234 ... 11 points 1234 ...