I have this data frame with this kind of column:
hmtl <div data-wrapper="true" style="font-size:9pt;font-family:'Segoe UI','Helvetica Neue',sans-serif;"><div>DCG_QLKNDFALGKFNDGOIQERKNGLÑADKFNGOWQIREG</div></div>
I need to clean this up and leave from “DCG_” up to where "</div>"
begins:
DCG_QLKNDFALGKFNDGOIQERKNGLÑADKFNGOWQIREG
Most of the cells in this column vary where the “DCG_” is located as well as the "</div>"
, I’m trying to use the following code line for this:
df['html'] = df['html'].str[df['html'.str.find('DCG_':]
but it just returns all null
Advertisement
Answer
Use pd.Series.str.extract
, where you specify a regular expression and extract anything in any capture groups in the first match:
>>> df['extracted'] = df['html'].str.extract("(DCG_.*?)</div>") >>> df.to_dict()
which gives:
{'html': {0: '<div data-wrapper="true" style="font-size:9pt;font-family:'Segoe UI','Helvetica Neue',sans-serif;"><div>DCG_QLKNDFALGKFNDGOIQERKNGLÑADKFNGOWQIREG</div></div>'}, 'extracted': {0: 'DCG_QLKNDFALGKFNDGOIQERKNGLÑADKFNGOWQIREG'}}
Regex explanation Try it online:
(DCG_.*?)</div> ( ) : Capturing group DCG_ : Literally DCG_ .*? : Zero or more of any character, lazy match </div> : Literally </div>