Skip to content
Advertisement

Trouble when adding values for NaN in DataFrame

I have this DataFrame:

    manufacturer    description
0   toyota          toyota, gmc 10 years old.
1   NaN             gmc, Motor runs and drives good.
2   NaN             Motor old, in pieces.
3   NaN             2 owner 0 rust. Cadillac.

And I want to fill the NaN values with keyword taken from the description. To that end I created a list with the keywords I want:

keyword = ['gmc', 'toyota', 'cadillac']

Finally, I want to loop over each row in the DataFrame. Split the contents from the “description” column in each row and, if that word is also in the “keyword” list, add it in the “manufacturer” column. As an example, it would look like this:

    manufacturer    description
0   toyota          toyota, gmc 10 years old.
1   gmc             gmc, Motor runs and drives good.
2   NaN             Motor old, in pieces.
3   cadillac        2 owner 0 rust. Cadillac.

Thanks to someone in this community I could improve my code to this:

import re
keyword = ['gmc', 'toyota', 'cadillac']
bag_of_words = []
for i, description in enumerate(test3['description']):
bag_of_words = re.findall(r"""[A-Za-z-]+""", test3["description"][i])
for word in bag_of_words: 
    if word.lower() in keyword:
            test3.loc[i, 'manufacturer'] = word.lower()

But I realized that the first row also changed values even though it was not NaN:

  manufacturer  description
0   gmc         toyota, gmc 10 years old.
1   gmc         gmc, Motor runs and drives good.
2   NaN         Motor old, in pieces.
3   cadillac    2 owner 0 rust. Cadillac.

I would like to only change the NaN values but when I try to add:

if word.lower() in keyword and test3.loc[i, 'manufacturer'] == np.nan:

It doesn’t have any effect.

Advertisement

Answer

np.nan == np.nan is False. A bit counter-intuitive perhaps =) But it should mean that the last conditional should never kick in. Not really clear from your question whether you see the same result or no result.

If you changed

for i, description in enumerate(test3['description']):

to

for i, description in zip(test3.loc[test3['manufacturer'].isna(), :].index, test3.loc[test3['manufacturer'].isna(), 'description']):

then I think it should work fine. You would only get the rows in which ‘manufacturer’ is NaN. You could also delete the == np.nan part since non-empty strings evaluate to True and np.nan evaluates to False but that would make your code harder to understand.

There a lot of ways in which your code could look nicer ;) but focus on learning to debug and the rest will come. And as long as it does what you want it to do who cares.

One way you could have debugged this would have been to print the truth value of each part of your conditional inside the loop.

print(bool(word.lower() in keyword))
print(bool(test3.loc[i, 'manufacturer'] == np.nan)

Best wishes!

Edit: okay, I should probably add how I would do this myself.

df = pd.DataFrame({'manufacturer': ['toyota', np.nan, np.nan, np.nan],
                   'description': ['toyota, gmc 10 years old.', 'gmc, Motor runs and drives good.', 'Motor old, in pieces.', '2 owner 0 rust. Cadillac.']})
keyword = ['gmc', 'toyota', 'cadillac']
filler = df['description'].map(lambda s: [word for word in keyword if word in s.lower()][0] 
                                         if bool([word for word in keyword if word in s.lower()]) 
                                         else np.nan)
df['manufacturer'] = df['manufacturer'].fillna(filler)

Not sure if you want the last or first item in keywords when both appear in the string tho. I set it to the first item here using index 0.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement