I’m working on my first project that isn’t straight out of a book but I’m having trouble getting a function to work.
The function receives a list of strings and a BeautifulSoup object and attempts to find each word in the soup.text. However, the code seems unable to find any words/strings at all even when I am certain it should be finding them. I checked and confirmed that the function is definitely receiving the list properly and that the URL works and returns what I expect it to when I do something like print(urlSoup)
.
The relevant code:
def find_words(words_list, urlSoup): for word in words_list: words_count = 0 if word.casefold() in urlSoup: # ideally it should also count the number of times the word shows up with the 'words_count' bit, # but I have an impression that this also won't work how I want it to. words_count += 1 print("The word " + word + " was found " + str(words_count) + " times in " + url + ".") else: print("The word '" + word + "' was not found in the URL you provided.")
Things I have tried to fix the fact that the IF statement does not activate (presumably because it doesn’t find any words/strings from the list in the soup.text) include removing the .casefold()
bit, changing soup.text
to soup.content
and changing the IF statement to something like
if urlSoup.find_all(word):
I also changed the parser for BeautifulSoup to lxml
but that didn’t work either. At this point I’m a bit stuck and despite looking around a bit on Stack Overflow and in the bs4 documentation I haven’t managed to crack this yet. I’m sure the solution is painfully obvious but as a beginner I’m afraid that I need a bit of help here.
I hope that I have provided enough information, please feel free to ask if you need me to explain further.
Edit with info as per request by chitown88: Here’s an example of a words_list
['running', 'outdoors', 'outdoor', 'shoes', 'clothing', 'delivery']
I used this list with an appropriate website but the urlSoup is a bit large to post here so here’s a Google Drive link if that’s okay. Please let me know if this is not alright and you’d rather I do something else. https://drive.google.com/file/d/1bhLjNLxHOrNvA3BBfm2Qh8qDrk5fLYp7/view?usp=sharing
Advertisement
Answer
did you use try except
block? the problem maybe with file encoding because I got an error with soup.txt
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 91868:
And words_count
will always 0 or 1, you need to use .count()
or Regex
to count how many times the substring is present in it
import re def find_words(words_list, urlSoup): url = 'soup.txt' for word in words_list: words_count = len(re.findall(word, urlSoup, re.IGNORECASE)) # remove re.IGNORECASE if you need exact casing # or # words_count = urlSoup.count(word) # exact casing if words_count > 0: print("The word " + word + " was found " + str(words_count) + " times in " + url + ".") else: print("The word '" + word + "' was not found in the URL you provided.") # add encoding="utf-8" to fix file read with open('soup.txt', 'r', encoding="utf-8") as f: words_list = ['running', 'outdoors', 'outdoor', 'shoes', 'clothing', 'delivery'] find_words(words_list, f.read())