Skip to content
Advertisement

How to extract multiple strings using Regex?

I have a column in a df contains the following values:

>>> import pandas as pd
>>> df = pd.DataFrame({'Sentence':['his is the results of my experiments KEY_abc_def KEY_mno_pqr KEY_blt_chm', 'I have researched the product KEY_abc_def, and KEY_blt_chm as requested', 'He got the idea from your message KEY_mno_pqr']})
>>> df
                                                Sentence
0       This is the results of my experiments KEY_abc_def KEY_mno_pqr KEY_blt_chm
1  I have researched the product KEY_abc_def, and KEY_blt_chm as requested
2            He got the idea from your message KEY_mno_pqr

I would like to use regex to extract the KEY into a new column without the actual “KEY_”. For those sentences have more than 1 KEY, they should be joined with a comma. The output should be as below:

>>> df
                                                Sentence                               KEY
0      This is the results of my experiments KEY_abc_def KEY_mno_pqr KEY_blt_chm    abc_def, mno_pqr, blt_chm
1  I have researched the product KEY_abc_def, and KEY_blt_chm as requested          abc_def, blt_chm     
2           He got the idea from your message KEY_mno_pqr                           mno_pqr  

I tried with this code but it is not working. Any suggestions would greatly be appreciated.

The code that I currently have only worked with the first KEY, and ignored the rest. I’m new with regex so any suggestions would be highly appreciated.

df['KEY']= df.sentence.str.extract("KEY_(w+)", expand=True)

Advertisement

Answer

Use

df['KEY']= df.sentence.str.findall("KEY_(w+)").str.join(",")

The Series.str.findall finds all occurrences of the captured substring and str.join(",") joins the results into a comma-separated string value.

Pandas test:

>>> df['KEY']= df['Sentence'].str.findall("KEY_(w+)").str.join(",")
>>> df
                                                                   Sentence                      KEY
0  his is the results of my experiments KEY_abc_def KEY_mno_pqr KEY_blt_chm  abc_def,mno_pqr,blt_chm
1   I have researched the product KEY_abc_def, and KEY_blt_chm as requested          abc_def,blt_chm
2                             He got the idea from your message KEY_mno_pqr                  mno_pqr

(Note in case you did not know that: I used pd.set_option('display.max_colwidth', None) to display all the data in the columns, see How to display full (non-truncated) dataframe information in html when converting from pandas dataframe to html?).

Advertisement