Skip to content
Advertisement

Python – trying to get beautifulsoup to find words in a list, but it’s unable to find them

I’m working on my first project that isn’t straight out of a book but I’m having trouble getting a function to work.

The function receives a list of strings and a BeautifulSoup object and attempts to find each word in the soup.text. However, the code seems unable to find any words/strings at all even when I am certain it should be finding them. I checked and confirmed that the function is definitely receiving the list properly and that the URL works and returns what I expect it to when I do something like print(urlSoup).

The relevant code:

def find_words(words_list, urlSoup):
    for word in words_list:
        words_count = 0
        if word.casefold() in urlSoup:
            # ideally it should also count the number of times the word shows up with the 'words_count' bit,
            # but I have an impression that this also won't work how I want it to. 
            words_count += 1
            print("The word " + word + " was found " + str(words_count) + " times in " + url + ".")
        else:
            print("The word '" + word + "' was not found in the URL you provided.")

Things I have tried to fix the fact that the IF statement does not activate (presumably because it doesn’t find any words/strings from the list in the soup.text) include removing the .casefold() bit, changing soup.text to soup.content and changing the IF statement to something like

if urlSoup.find_all(word):

I also changed the parser for BeautifulSoup to lxml but that didn’t work either. At this point I’m a bit stuck and despite looking around a bit on Stack Overflow and in the bs4 documentation I haven’t managed to crack this yet. I’m sure the solution is painfully obvious but as a beginner I’m afraid that I need a bit of help here.

I hope that I have provided enough information, please feel free to ask if you need me to explain further.

Edit with info as per request by chitown88: Here’s an example of a words_list

['running', 'outdoors', 'outdoor', 'shoes', 'clothing', 'delivery']

I used this list with an appropriate website but the urlSoup is a bit large to post here so here’s a Google Drive link if that’s okay. Please let me know if this is not alright and you’d rather I do something else. https://drive.google.com/file/d/1bhLjNLxHOrNvA3BBfm2Qh8qDrk5fLYp7/view?usp=sharing

Advertisement

Answer

did you use try except block? the problem maybe with file encoding because I got an error with soup.txt

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 91868:

And words_count will always 0 or 1, you need to use .count() or Regex to count how many times the substring is present in it

import re

def find_words(words_list, urlSoup):
    url = 'soup.txt'
    for word in words_list:
        words_count = len(re.findall(word, urlSoup, re.IGNORECASE)) # remove re.IGNORECASE if you need exact casing
        # or
        # words_count = urlSoup.count(word) # exact casing
        if words_count > 0:
            print("The word " + word + " was found " + str(words_count) + " times in " + url + ".")
        else:
            print("The word '" + word + "' was not found in the URL you provided.")
 
# add encoding="utf-8" to fix file read           
with open('soup.txt', 'r', encoding="utf-8") as f:
    words_list = ['running', 'outdoors', 'outdoor', 'shoes', 'clothing', 'delivery']
    find_words(words_list, f.read())
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement