Ive got a list of strings by scraping a website. I want the code to print the HTML elements from that list IF they contain “L” in them. Ive managed to write a code that works just fine on “normal list” that I manually just write into the code (example 1 below) but as soon as I try using that code to filter the list of HTML elements it only prints empty “[]” even though I know there should be multiple values.
Here is the code that works:
import urllib.request
from bs4 import BeautifulSoup
url = 'https://kouluruoka.fi/menu/kouvola_koulujenruokalista'
request = urllib.request.Request(url)
content = urllib.request.urlopen(request)
parse = BeautifulSoup(content, 'html.parser')
span_elements = parse.find_all('span')
#a list like this works just fine
lst = ['HOLA','BONJOUR','HELLO','KONNICHIWA','SALVE','GUTEN DAG']
filtered_list = list(filter(lambda k: 'L' in k, lst))
print(filtered_list)
>>>['HOLA','HELLO','SALVE']
But as soon as I use my web scraping list (span_elements) insteas of a list of hellos, it prints blank:
import urllib.request
from bs4 import BeautifulSoup
url = 'https://kouluruoka.fi/menu/kouvola_koulujenruokalista'
request = urllib.request.Request(url)
content = urllib.request.urlopen(request)
parse = BeautifulSoup(content, 'html.parser')
span_elements = parse.find_all('span')
#a list of HTML elements doesnt work
lst = span_elements
filtered_list = list(filter(lambda k: 'L' in k, lst))
print(filtered_list)
>>>[]
Ive been trying for hours and got nowhere, help is appreciated! Thank you!
Advertisement
Answer
The elements in filtered_list are not strings but bs4 element objects. If you change your filter to convert them to str
before using in
, the code works:
filtered_list = list(filter(lambda k: 'L' in str(k), lst))
if you want only the inside of the <span>
use .text
:
lst = [ x.text for x in span_elements ]
filtered_list = list(filter(lambda k: 'L' in k, lst))
I have never used bs4 but the clue was in printing the original list:
print(lst)
output:
[<span>KOULURUOKA.FI</span>, <span></span>, <span>Tämä<!-- --> viikko</span>, ...
this is not a list of strings, no '
in sight.