python substrn cells in a column dataframe

Question

I have this data frame with this kind of column: I need to clean this up and leave from “DCG_” up to where “” begins: Most of the cells in this column vary where the “DCG_” is located as well as the “”, I’m trying to use the followi…

Accepted Answer

Use pd.Series.str.extract, where you specify a regular expression and extract anything in any capture groups in the first match:>>> df['extracted'] = df['html'].str.extract("(DCG_.*?)")>>> df.to_dict()which gives:{'html': {0: '

DCG_QLKNDFALGKFNDGOIQERKNGLÑADKFNGOWQIREG

'}, 'extracted': {0: 'DCG_QLKNDFALGKFNDGOIQERKNGLÑADKFNGOWQIREG'}}Regex explanation Try it online:(DCG_.*?)( ) : Capturing group DCG_ : Literally DCG_ .*? : Zero or more of any character, lazy match : Literally

Advertisement

Answer