Skip to content
Advertisement

np.select pandas dataframe based on column of prefix and values

So I have two dataframes

main_df, about 800 rows

description category
ABCD ONE
XYZ THREE
ABC
QWE

keyword_df, it is about 50 rows

keyword category
AB FIVE

What I’m trying to achieve =

main_df

description category
ABCD ONE
XYZ THREE
ABC FIVE
QWE 0

conditions = [(main_df[‘Description’].str.startswith(‘AB’)) & (main_df[‘category’).isnull()]

values = keyword_df[‘category’].tolist()

main_df[‘category’] = np.select(conditions, values)

I was able to create a list of strings

[“(main_df[‘Description’].str.startswith(‘AB’)) & (main_df[‘category’).isnull()”,
“(main_df[‘Description’].str.startswith(‘CD’)) & (main_df[‘category’).isnull()”, …]

But it needs to be a list of conditions and not a list of strings for it to work. Appreciate any help!

Advertisement

Answer

Since you only have 50 rows in the keyword frame, you could just iterate over those and update the main frame accordingly:

import numpy as np
import pandas as pd

main_df = pd.DataFrame({'description': ['ABCD', 'XYZ', 'ABC', 'QWE'],
                        'category': ['ONE', 'THREE', np.nan, np.nan]})
keyword_df = pd.DataFrame({'keyword': ['AB'],
                           'category': ['FIVE']}) 

for key in keyword_df.itertuples(index=False):
    mask = (main_df['description'].str.startswith(key[0]) 
            & main_df['category'].isnull())
    main_df.loc[mask, 'category'] = key[1] 

main_df
    description   category
0   ABCD          ONE
1   XYZ           THREE
2   ABC           FIVE
3   QWE           NaN
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement