Skip to content
Advertisement

How do I filter HTML elements in Python

Ive got a list of strings by scraping a website. I want the code to print the HTML elements from that list IF they contain “L” in them. Ive managed to write a code that works just fine on “normal list” that I manually just write into the code (example 1 below) but as soon as I try using that code to filter the list of HTML elements it only prints empty “[]” even though I know there should be multiple values.

Here is the code that works:

import urllib.request
from bs4 import BeautifulSoup

url = 'https://kouluruoka.fi/menu/kouvola_koulujenruokalista'
request = urllib.request.Request(url)
content = urllib.request.urlopen(request)
parse = BeautifulSoup(content, 'html.parser')

span_elements = parse.find_all('span')

#a list like this works just fine
lst = ['HOLA','BONJOUR','HELLO','KONNICHIWA','SALVE','GUTEN DAG']

filtered_list = list(filter(lambda k: 'L' in k, lst))

print(filtered_list)

>>>['HOLA','HELLO','SALVE']

But as soon as I use my web scraping list (span_elements) insteas of a list of hellos, it prints blank:

import urllib.request
from bs4 import BeautifulSoup

url = 'https://kouluruoka.fi/menu/kouvola_koulujenruokalista'
request = urllib.request.Request(url)
content = urllib.request.urlopen(request)
parse = BeautifulSoup(content, 'html.parser')

span_elements = parse.find_all('span')

#a list of HTML elements doesnt work
lst = span_elements

filtered_list = list(filter(lambda k: 'L' in k, lst))

print(filtered_list)

>>>[]

Ive been trying for hours and got nowhere, help is appreciated! Thank you!

Advertisement

Answer

The elements in filtered_list are not strings but bs4 element objects. If you change your filter to convert them to str before using in, the code works:

filtered_list = list(filter(lambda k: 'L' in str(k), lst))

if you want only the inside of the <span> use .text:

lst = [ x.text for x in span_elements ]
filtered_list = list(filter(lambda k: 'L' in k, lst))

I have never used bs4 but the clue was in printing the original list:

print(lst)

output:

[<span>KOULURUOKA.FI</span>, <span></span>, <span>Tämä<!-- --> viikko</span>, ...

this is not a list of strings, no ' in sight.

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement