I want to create a new column in a dataframe
by matching the values in an existing column’s values with a predefined list of values. I have two approaches to this below. Both run but dont give me exactly what I want. I prefer the first approach over the second but not sure where I am going wrong with both. I want the solution to be succinct without having to write out alot of np.where
statements.
First approach:
words = [['one man ran','two men ran','three men ran'],['red balloons','white shirt','blue dress']] df3 = pd.DataFrame(words, columns = ['col1','col2','col3']) search_words1 = ['three','blue'] def columns(search_words1): for i in search_words1: return "".join(np.where((df3['col3'].str.contains(i)), i, "")) df3['col4'] = df3['col3'].apply(lambda x: columns(x)) df3
incomplete result:
col1 col2 col3 col4 0 one man ran two men ran three men ran t 1 red balloons white shirt blue dress b
Second approach:
search_words1 = ['three','blue'] def my_comments(search_words1): return "".join([i for i in search_words1 if any(i in x for x in df3['col3'])]) df3['col4'] = df3['col3'].apply(lambda x: my_comments(x)) df3
incomplete result:
col1 col2 col3 col4 0 one man ran two men ran three men ran three men ran 1 red balloons white shirt blue dress blue dress
Desired output for both approaches:
col1 col2 col3 col4 0 one man ran two men ran three men ran three 1 red balloons white shirt blue dress blue
Advertisement
Answer
Use str.extract
: create a regex pattern of your search words and try to extract the matched pattern:
pattern = fr"b({'|'.join(search_words1)})b" df3['col4'] = df3['col3'].str.extract(pattern)
Pattern:
>>> print(pattern) b(man|red)b
b
matches the empty string, but only at the beginning or end of a word. The ( )
is the capture group.