Skip to content
Advertisement

python substrn cells in a column dataframe

I have this data frame with this kind of column:

hmtl
<div data-wrapper="true" style="font-size:9pt;font-family:'Segoe UI','Helvetica Neue',sans-serif;"><div>DCG_QLKNDFALGKFNDGOIQERKNGLÑADKFNGOWQIREG</div></div>

I need to clean this up and leave from “DCG_” up to where "</div>" begins:

DCG_QLKNDFALGKFNDGOIQERKNGLÑADKFNGOWQIREG

Most of the cells in this column vary where the “DCG_” is located as well as the "</div>", I’m trying to use the following code line for this:

df['html'] = df['html'].str[df['html'.str.find('DCG_':]

but it just returns all null

Advertisement

Answer

Use pd.Series.str.extract, where you specify a regular expression and extract anything in any capture groups in the first match:

>>> df['extracted'] = df['html'].str.extract("(DCG_.*?)</div>")
>>> df.to_dict()

which gives:

{'html': {0: '<div data-wrapper="true" style="font-size:9pt;font-family:'Segoe UI','Helvetica Neue',sans-serif;"><div>DCG_QLKNDFALGKFNDGOIQERKNGLÑADKFNGOWQIREG</div></div>'},
 'extracted': {0: 'DCG_QLKNDFALGKFNDGOIQERKNGLÑADKFNGOWQIREG'}}

Regex explanation Try it online:

(DCG_.*?)</div>
(       )        : Capturing group
 DCG_            : Literally DCG_
     .*?         : Zero or more of any character, lazy match
         </div>  : Literally </div>
Advertisement