I want to create a new column in a dataframe
by matching the values in an existing column’s values with a predefined list of values. I have two approaches to this below. Both run but dont give me exactly what I want. I prefer the first approach over the second but not sure where I am going wrong with both. I want the solution to be succinct without having to write out alot of np.where
statements.
First approach:
JavaScript
x
13
13
1
words = [['one man ran','two men ran','three men ran'],['red balloons','white shirt','blue dress']]
2
df3 = pd.DataFrame(words, columns = ['col1','col2','col3'])
3
4
search_words1 = ['three','blue']
5
6
def columns(search_words1):
7
for i in search_words1:
8
return "".join(np.where((df3['col3'].str.contains(i)), i, ""))
9
10
11
df3['col4'] = df3['col3'].apply(lambda x: columns(x))
12
df3
13
incomplete result:
JavaScript
1
5
1
col1 col2 col3 col4
2
0 one man ran two men ran three men ran t
3
1 red balloons white shirt blue dress b
4
5
Second approach:
JavaScript
1
9
1
search_words1 = ['three','blue']
2
3
def my_comments(search_words1):
4
return "".join([i for i in search_words1 if any(i in x for x in df3['col3'])])
5
6
7
df3['col4'] = df3['col3'].apply(lambda x: my_comments(x))
8
df3
9
incomplete result:
JavaScript
1
4
1
col1 col2 col3 col4
2
0 one man ran two men ran three men ran three men ran
3
1 red balloons white shirt blue dress blue dress
4
Desired output for both approaches:
JavaScript
1
4
1
col1 col2 col3 col4
2
0 one man ran two men ran three men ran three
3
1 red balloons white shirt blue dress blue
4
Advertisement
Answer
Use str.extract
: create a regex pattern of your search words and try to extract the matched pattern:
JavaScript
1
3
1
pattern = fr"b({'|'.join(search_words1)})b"
2
df3['col4'] = df3['col3'].str.extract(pattern)
3
Pattern:
JavaScript
1
3
1
>>> print(pattern)
2
b(man|red)b
3
b
matches the empty string, but only at the beginning or end of a word. The ( )
is the capture group.