I have this DataFrame:
manufacturer description 0 toyota toyota, gmc 10 years old. 1 NaN gmc, Motor runs and drives good. 2 NaN Motor old, in pieces. 3 NaN 2 owner 0 rust. Cadillac.
And I want to fill the NaN values with keyword taken from the description. To that end I created a list with the keywords I want:
keyword = ['gmc', 'toyota', 'cadillac']
Finally, I want to loop over each row in the DataFrame. Split the contents from the “description” column in each row and, if that word is also in the “keyword” list, add it in the “manufacturer” column. As an example, it would look like this:
manufacturer description 0 toyota toyota, gmc 10 years old. 1 gmc gmc, Motor runs and drives good. 2 NaN Motor old, in pieces. 3 cadillac 2 owner 0 rust. Cadillac.
Thanks to someone in this community I could improve my code to this:
import re keyword = ['gmc', 'toyota', 'cadillac'] bag_of_words = [] for i, description in enumerate(test3['description']): bag_of_words = re.findall(r"""[A-Za-z-]+""", test3["description"][i]) for word in bag_of_words: if word.lower() in keyword: test3.loc[i, 'manufacturer'] = word.lower()
But I realized that the first row also changed values even though it was not NaN:
manufacturer description 0 gmc toyota, gmc 10 years old. 1 gmc gmc, Motor runs and drives good. 2 NaN Motor old, in pieces. 3 cadillac 2 owner 0 rust. Cadillac.
I would like to only change the NaN values but when I try to add:
if word.lower() in keyword and test3.loc[i, 'manufacturer'] == np.nan:
It doesn’t have any effect.
Advertisement
Answer
np.nan == np.nan
is False. A bit counter-intuitive perhaps =) But it should mean that the last conditional should never kick in. Not really clear from your question whether you see the same result or no result.
If you changed
for i, description in enumerate(test3['description']):
to
for i, description in zip(test3.loc[test3['manufacturer'].isna(), :].index, test3.loc[test3['manufacturer'].isna(), 'description']):
then I think it should work fine. You would only get the rows in which ‘manufacturer’ is NaN. You could also delete the == np.nan
part since non-empty strings evaluate to True and np.nan evaluates to False but that would make your code harder to understand.
There a lot of ways in which your code could look nicer ;) but focus on learning to debug and the rest will come. And as long as it does what you want it to do who cares.
One way you could have debugged this would have been to print the truth value of each part of your conditional inside the loop.
print(bool(word.lower() in keyword)) print(bool(test3.loc[i, 'manufacturer'] == np.nan)
Best wishes!
Edit: okay, I should probably add how I would do this myself.
df = pd.DataFrame({'manufacturer': ['toyota', np.nan, np.nan, np.nan], 'description': ['toyota, gmc 10 years old.', 'gmc, Motor runs and drives good.', 'Motor old, in pieces.', '2 owner 0 rust. Cadillac.']}) keyword = ['gmc', 'toyota', 'cadillac'] filler = df['description'].map(lambda s: [word for word in keyword if word in s.lower()][0] if bool([word for word in keyword if word in s.lower()]) else np.nan) df['manufacturer'] = df['manufacturer'].fillna(filler)
Not sure if you want the last or first item in keywords when both appear in the string tho. I set it to the first item here using index 0.