Let’s take an example.
I have a list of categories that are identified :
JavaScript
x
2
1
L_known_categories = ["Orange","Green","Red","Black & White"]
2
The strings in that list can’t be a substring of another string in that list.
And a dataframe :
JavaScript
1
11
11
1
df = pd.DataFrame({"Items":["green apple","blue bottle","RED APPLE","Green paper","Black & White glasses",
2
"An orange fruit"]})
3
4
Items
5
0 green apple
6
1 blue bottle
7
2 RED APPLE
8
3 Green paper
9
4 Black & White glasses
10
5 An orange fruit
11
I would like to add a column Category
to this dataframe. If the string in the column Items
starts as a string in L_known_categories
, no matter the case of the characters, the category is that string. If no string founded, the category is the string in column Items
.
I could use a for loop but it is not efficient with my real big dataframe. How please could I do ?
Expected output :
JavaScript
1
8
1
Items Category
2
0 green apple Green
3
1 blue bottle blue bottle
4
2 RED APPLE Red
5
3 Green paper Green
6
4 Black & White glasses Black & White
7
5 An orange fruit An orange fruit
8
Advertisement
Answer
You can use regex
in pandas.Series.str.extract
:
JavaScript
1
15
15
1
>>> df['Category'] = df['Items'].str.title().str.extract(
2
'(^'
3
+ '|'.join(L_known_categories)
4
+ ')'
5
)[0].fillna(df['Items'])
6
7
>>> df
8
Items Category
9
0 green apple Green
10
1 blue bottle blue bottle
11
2 RED APPLE Red
12
3 Green paper Green
13
4 Black & White glasses Black & White
14
5 An orange fruit An orange fruit
15