I would like to do some word filtering (extracting only items in ‘keyword’ list that exist in ‘whitelist’).
Here is my code so far:
whitelist = ['Cat', 'Dog', 'Cow'] keyword = ['Cat, Cow, Horse', 'Bird, Whale, Dog', 'Pig, Chicken', 'Tiger, Cat'] keyword_filter = [] for word in whitelist: for i in range(len(keyword)): if word in keyword[i]: keyword_filter.append(word) else: pass
I want to remove every word except for ‘Cat’, ‘Dog’, and ‘Cow’ (which are in the ‘whitelist’) so that the result (‘keyword_filter’ list) will look like this:
['Cat, Cow', 'Dog', '', 'Cat']
However, I got the result something like this:
['Cat', 'Cat', 'Dog', 'Cow']
I would sincerely appreciate if you can give some advice.
Advertisement
Answer
You need to split the strings in the list and check if word in the split is contained in the whitelist. Then rejoin all words in the whitelist after filtering:
whitelist = {'Cat', 'Dog', 'Cow'} filtered = [] for words in keyword: filtered.append(', '.join(w for w in words.split(', ') if w in whitelist)) print(filtered) # ['Cat, Cow', 'Dog', '', 'Cat']
Better to make whitelist
a set to improve the performance for lookup of each word.
You could also use re.findall
to find all parts of each word matching strings contained in the whitelist, and then rejoin after finding the matches:
import re pattern = re.compile(',?s?Cat|,?s?Dog|,?s?Cow') filtered = [''.join(pattern.findall(words))) for words in keyword]