So I have two dataframes
main_df, about 800 rows
| description | category | 
|---|---|
| ABCD | ONE | 
| XYZ | THREE | 
| ABC | |
| QWE | 
keyword_df, it is about 50 rows
| keyword | category | 
|---|---|
| AB | FIVE | 
What I’m trying to achieve =
main_df
| description | category | 
|---|---|
| ABCD | ONE | 
| XYZ | THREE | 
| ABC | FIVE | 
| QWE | 0 | 
conditions = [(main_df[‘Description’].str.startswith(‘AB’)) & (main_df[‘category’).isnull()]
values = keyword_df[‘category’].tolist()
main_df[‘category’] = np.select(conditions, values)
I was able to create a list of strings
[“(main_df[‘Description’].str.startswith(‘AB’)) & (main_df[‘category’).isnull()”,
“(main_df[‘Description’].str.startswith(‘CD’)) & (main_df[‘category’).isnull()”, …]
But it needs to be a list of conditions and not a list of strings for it to work. Appreciate any help!
Advertisement
Answer
Since you only have 50 rows in the keyword frame, you could just iterate over those and update the main frame accordingly:
import numpy as np
import pandas as pd
main_df = pd.DataFrame({'description': ['ABCD', 'XYZ', 'ABC', 'QWE'],
                        'category': ['ONE', 'THREE', np.nan, np.nan]})
keyword_df = pd.DataFrame({'keyword': ['AB'],
                           'category': ['FIVE']}) 
for key in keyword_df.itertuples(index=False):
    mask = (main_df['description'].str.startswith(key[0]) 
            & main_df['category'].isnull())
    main_df.loc[mask, 'category'] = key[1] 
main_df
description category 0 ABCD ONE 1 XYZ THREE 2 ABC FIVE 3 QWE NaN