Let’s take an example.
I have a list of categories that are identified :
L_known_categories = ["Orange","Green","Red","Black & White"]
The strings in that list can’t be a substring of another string in that list.
And a dataframe :
df = pd.DataFrame({"Items":["green apple","blue bottle","RED APPLE","Green paper","Black & White glasses", "An orange fruit"]}) Items 0 green apple 1 blue bottle 2 RED APPLE 3 Green paper 4 Black & White glasses 5 An orange fruit
I would like to add a column Category
to this dataframe. If the string in the column Items
starts as a string in L_known_categories
, no matter the case of the characters, the category is that string. If no string founded, the category is the string in column Items
.
I could use a for loop but it is not efficient with my real big dataframe. How please could I do ?
Expected output :
Items Category 0 green apple Green 1 blue bottle blue bottle 2 RED APPLE Red 3 Green paper Green 4 Black & White glasses Black & White 5 An orange fruit An orange fruit
Advertisement
Answer
You can use regex
in pandas.Series.str.extract
:
>>> df['Category'] = df['Items'].str.title().str.extract( '(^' + '|'.join(L_known_categories) + ')' )[0].fillna(df['Items']) >>> df Items Category 0 green apple Green 1 blue bottle blue bottle 2 RED APPLE Red 3 Green paper Green 4 Black & White glasses Black & White 5 An orange fruit An orange fruit