HTML problem with tags and classes in a simple and little scraping with BeautifulSoup

Question

I am new and am trying to get BeautifulSoup to work. I have Html problems with recovering classes and tags. I get closer, but there is something I'm wrong. I insert wrong tags and classes to scrape the title, time, link, and text of a news item. I would like to scrape all those titles in the vertical list, then

Accepted Answer

Here is a solution you can give it a try,import requestsfrom bs4 import BeautifulSoup# mock browser requestheaders = {    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}site = requests.get('https://www.tuttomercatoweb.com/atalanta/', headers=headers)soup = BeautifulSoup(site.content, 'html.parser')news = soup.find_all('div', attrs={"class": "tcc-list-news"})for each in news:    for div in each.find_all("div"):        print("-- Time ", div.find('span', attrs={'class': 'hh serif'}).text)        print("-- Href ", div.find("a")['href'])        print("-- Text ", " ".join([span.text for span in div.select("a > span")]))-- Time  11:36-- Href  https://www.tuttomercatoweb.com/atalanta/?action=read&idtmw=1661241-- Text  focus Serie A, punti nel 2022: Juve prima, ma un solo punto in più rispetto a Milan e Napoli-------------------------------- Time  11:24-- Href  https://www.tuttomercatoweb.com/atalanta/?action=read&idtmw=1661233-- Text  focus Serie A, chi più in forma? Le ultime 5 gare: Sassuolo e Juve in vetta, crisi Venezia-------------------------------- Time  11:15-- Href  https://www.tuttomercatoweb.com/atalanta/?action=read&idtmw=1661229-- Text  Le pagelle di Cissé: come nelle migliori favole. Dalla seconda categoria al gol in serie A------------------------------......EDIT:Why headers are required here ?How to use Python requests to fake a browser visit a.k.a and generate User Agent?

Advertisement

Answer