I have a column in a df contains the following values:
>>> import pandas as pd >>> df = pd.DataFrame({'Sentence':['his is the results of my experiments KEY_abc_def KEY_mno_pqr KEY_blt_chm', 'I have researched the product KEY_abc_def, and KEY_blt_chm as requested', 'He got the idea from your message KEY_mno_pqr']}) >>> df Sentence 0 This is the results of my experiments KEY_abc_def KEY_mno_pqr KEY_blt_chm 1 I have researched the product KEY_abc_def, and KEY_blt_chm as requested 2 He got the idea from your message KEY_mno_pqr
I would like to use regex to extract the KEY into a new column without the actual “KEY_”. For those sentences have more than 1 KEY, they should be joined with a comma. The output should be as below:
>>> df Sentence KEY 0 This is the results of my experiments KEY_abc_def KEY_mno_pqr KEY_blt_chm abc_def, mno_pqr, blt_chm 1 I have researched the product KEY_abc_def, and KEY_blt_chm as requested abc_def, blt_chm 2 He got the idea from your message KEY_mno_pqr mno_pqr
I tried with this code but it is not working. Any suggestions would greatly be appreciated.
The code that I currently have only worked with the first KEY, and ignored the rest. I’m new with regex so any suggestions would be highly appreciated.
df['KEY']= df.sentence.str.extract("KEY_(w+)", expand=True)
Advertisement
Answer
Use
df['KEY']= df.sentence.str.findall("KEY_(w+)").str.join(",")
The Series.str.findall
finds all occurrences of the captured substring and str.join(",")
joins the results into a comma-separated string value.
Pandas test:
>>> df['KEY']= df['Sentence'].str.findall("KEY_(w+)").str.join(",") >>> df Sentence KEY 0 his is the results of my experiments KEY_abc_def KEY_mno_pqr KEY_blt_chm abc_def,mno_pqr,blt_chm 1 I have researched the product KEY_abc_def, and KEY_blt_chm as requested abc_def,blt_chm 2 He got the idea from your message KEY_mno_pqr mno_pqr
(Note in case you did not know that: I used pd.set_option('display.max_colwidth', None)
to display all the data in the columns, see How to display full (non-truncated) dataframe information in html when converting from pandas dataframe to html?).