Ive got a list of strings by scraping a website. I want the code to print the HTML elements from that list IF they contain “L” in them. Ive managed to write a code that works just fine on “normal list” that I manually just write into the code (example 1 below) but as soon as I try using that code to filter the list of HTML elements it only prints empty “[]” even though I know there should be multiple values.
Here is the code that works:
import urllib.request from bs4 import BeautifulSoup url = 'https://kouluruoka.fi/menu/kouvola_koulujenruokalista' request = urllib.request.Request(url) content = urllib.request.urlopen(request) parse = BeautifulSoup(content, 'html.parser') span_elements = parse.find_all('span') #a list like this works just fine lst = ['HOLA','BONJOUR','HELLO','KONNICHIWA','SALVE','GUTEN DAG'] filtered_list = list(filter(lambda k: 'L' in k, lst)) print(filtered_list) >>>['HOLA','HELLO','SALVE']
But as soon as I use my web scraping list (span_elements) insteas of a list of hellos, it prints blank:
import urllib.request from bs4 import BeautifulSoup url = 'https://kouluruoka.fi/menu/kouvola_koulujenruokalista' request = urllib.request.Request(url) content = urllib.request.urlopen(request) parse = BeautifulSoup(content, 'html.parser') span_elements = parse.find_all('span') #a list of HTML elements doesnt work lst = span_elements filtered_list = list(filter(lambda k: 'L' in k, lst)) print(filtered_list) >>>[]
Ive been trying for hours and got nowhere, help is appreciated! Thank you!
Advertisement
Answer
The elements in filtered_list are not strings but bs4 element objects. If you change your filter to convert them to str
before using in
, the code works:
filtered_list = list(filter(lambda k: 'L' in str(k), lst))
if you want only the inside of the <span>
use .text
:
lst = [ x.text for x in span_elements ] filtered_list = list(filter(lambda k: 'L' in k, lst))
I have never used bs4 but the clue was in printing the original list:
print(lst)
output:
[<span>KOULURUOKA.FI</span>, <span></span>, <span>Tämä<!-- --> viikko</span>, ...
this is not a list of strings, no '
in sight.