I have this data frame with this kind of column:
JavaScript
x
3
1
hmtl
2
<div data-wrapper="true" style="font-size:9pt;font-family:'Segoe UI','Helvetica Neue',sans-serif;"><div>DCG_QLKNDFALGKFNDGOIQERKNGLÑADKFNGOWQIREG</div></div>
3
I need to clean this up and leave from “DCG_” up to where "</div>"
begins:
JavaScript
1
2
1
DCG_QLKNDFALGKFNDGOIQERKNGLÑADKFNGOWQIREG
2
Most of the cells in this column vary where the “DCG_” is located as well as the "</div>"
, I’m trying to use the following code line for this:
JavaScript
1
2
1
df['html'] = df['html'].str[df['html'.str.find('DCG_':]
2
but it just returns all null
Advertisement
Answer
Use pd.Series.str.extract
, where you specify a regular expression and extract anything in any capture groups in the first match:
JavaScript
1
3
1
>>> df['extracted'] = df['html'].str.extract("(DCG_.*?)</div>")
2
>>> df.to_dict()
3
which gives:
JavaScript
1
3
1
{'html': {0: '<div data-wrapper="true" style="font-size:9pt;font-family:'Segoe UI','Helvetica Neue',sans-serif;"><div>DCG_QLKNDFALGKFNDGOIQERKNGLÑADKFNGOWQIREG</div></div>'},
2
'extracted': {0: 'DCG_QLKNDFALGKFNDGOIQERKNGLÑADKFNGOWQIREG'}}
3
Regex explanation Try it online:
JavaScript
1
6
1
(DCG_.*?)</div>
2
( ) : Capturing group
3
DCG_ : Literally DCG_
4
.*? : Zero or more of any character, lazy match
5
</div> : Literally </div>
6