I would like to extract some information from a column in my dataframe:
Example
JavaScript
x
8
1
Col
2
7 points — it is an example
3
13 points — as above
4
some other text
5
1 point — "what to say more?"
6
13 points —
7
11 points — 1234
8
I was using str.contain to extract the first part (i.e., all the information before the first dash, where there is.
JavaScript
1
3
1
m = (df['Col'].str.contains(r'(?i)^d+spoint | points'))
2
df[m]
3
I am still getting the same original column (so no extraction). My output would consist in two columns, one without points information (Col1) and another one (Col2) with the text extracted.
JavaScript
1
8
1
Col1
2
7 points
3
13 points
4
# need to still keep the row, even if empty
5
1 point
6
13 points
7
11 points
8
and
JavaScript
1
8
1
Col2
2
it is an example
3
as above
4
some other text
5
"what to say more?"
6
7
1234
8
It is important to consider the first dash where there is, since I might have more dash included in the text.
It seems to be this symbol -
, but maybe it can be a longer dash. I copied and pasted from my dataset, but copying it here it seems to be slightly different.
Advertisement
Answer
Try using str.extract
with Regex.
Ex:
JavaScript
1
5
1
import pandas as pd
2
3
df[['Col1', 'Col2']] = df['Col'].str.extract(r"(d+ points?)?s*—?s*(.*)", expand=True)
4
print(df)
5
Output:
JavaScript
1
8
1
Col Col1 Col2
2
0 7 points — it is an example 7 points it is an example
3
1 13 points — as above 13 points as above
4
2 some other text NaN some other text
5
3 1 point — "what to say more?" 1 point "what to say more?"
6
4 13 points — 13 points
7
5 11 points — 1234 11 points 1234
8