Let's take an example. I have a list of categories that are identified : The strings in that list can't be a substring of another string in that list. And a dataframe : I would like to add a column Category to this dataframe. If the string in the column Items starts as a string in L_known_categories, no matter the

Apply string in list according to beginning of the strings in a pandas dataframe column

Let’s take an example.

I have a list of categories that are identified :

L_known_categories = ["Orange","Green","Red","Black & White"]

JavaScript
​x
 
L_known_categories = ["Orange","Green","Red","Black & White"]
​

The strings in that list can’t be a substring of another string in that list.

And a dataframe :

df = pd.DataFrame({"Items":["green apple","blue bottle","RED APPLE","Green paper","Black & White glasses",
                            "An orange fruit"]})

                   Items
0            green apple
1            blue bottle
2              RED APPLE
3            Green paper
4  Black & White glasses
5        An orange fruit

JavaScript
 
df = pd.DataFrame({"Items":["green apple","blue bottle","RED APPLE","Green paper","Black & White glasses",
                            "An orange fruit"]})
​
                   Items
0            green apple
1            blue bottle
2              RED APPLE
3            Green paper
4  Black & White glasses
5        An orange fruit
​

I would like to add a column Category to this dataframe. If the string in the column Items starts as a string in L_known_categories, no matter the case of the characters, the category is that string. If no string founded, the category is the string in column Items.

I could use a for loop but it is not efficient with my real big dataframe. How please could I do ?

Expected output :

                   Items         Category
0            green apple            Green
1            blue bottle      blue bottle
2              RED APPLE              Red
3            Green paper            Green
4  Black & White glasses    Black & White
5        An orange fruit  An orange fruit

JavaScript
 
                   Items         Category
0            green apple            Green
1            blue bottle      blue bottle
2              RED APPLE              Red
3            Green paper            Green
4  Black & White glasses    Black & White
5        An orange fruit  An orange fruit
​

Answer

You can use regex in pandas.Series.str.extract:

>>> df['Category'] = df['Items'].str.title().str.extract(
        '(^' 
        + '|'.join(L_known_categories) 
        + ')'
    )[0].fillna(df['Items'])

>>> df
    Items                   Category
0   green apple             Green
1   blue bottle             blue bottle
2   RED APPLE               Red
3   Green paper             Green
4   Black & White glasses   Black & White
5   An orange fruit         An orange fruit

JavaScript
 
>>> df['Category'] = df['Items'].str.title().str.extract(
        '(^' 
        + '|'.join(L_known_categories) 
        + ')'
    )[0].fillna(df['Items'])
​
>>> df
    Items                   Category
0   green apple             Green
1   blue bottle             blue bottle
2   RED APPLE               Red
3   Green paper             Green
4   Black & White glasses   Black & White
5   An orange fruit         An orange fruit
​

Advertisement

Answer