Skip to content
Advertisement

HTML problem with tags and classes in a simple and little scraping with BeautifulSoup

I am new and am trying to get BeautifulSoup to work. I have Html problems with recovering classes and tags. I get closer, but there is something I’m wrong. I insert wrong tags and classes to scrape the title, time, link, and text of a news item.

I would like to scrape all those titles in the vertical list, then scrape the date, title, link, and content. enter image description here

Can you help me with the right html class and tagging please?

I’m not getting any errors, but the python console stays empty

>>> 

Code

import requests
from bs4 import BeautifulSoup
    
site = requests.get('url')
beautify = BeautifulSoup(site.content,'html5lib')
    
news = beautify.find_all('div', {'class','$00'})
arti = []
    
for each in news:
  time = each.find('span', {'class','hh serif'}).text
  title = each.find('span', {'class','title'}).text
  link = each.a.get('href')
  r = requests.get(url)
  soup = BeautifulSoup(r.text,'html5lib')
  content = soup.find('div', class_ = "read__content").text.strip()
    
  print(" ")   
  print(time)
  print(title)
  print(link)
  print(" ") 
  print(content)
  print(" ") 

Advertisement

Answer

Here is a solution you can give it a try,

import requests
from bs4 import BeautifulSoup

# mock browser request
headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}
site = requests.get('https://www.tuttomercatoweb.com/atalanta/', headers=headers)
soup = BeautifulSoup(site.content, 'html.parser')

news = soup.find_all('div', attrs={"class": "tcc-list-news"})

for each in news:
    for div in each.find_all("div"):
        print("-- Time ", div.find('span', attrs={'class': 'hh serif'}).text)
        print("-- Href ", div.find("a")['href'])
        print("-- Text ", " ".join([span.text for span in div.select("a > span")]))

-- Time  11:36
-- Href  https://www.tuttomercatoweb.com/atalanta/?action=read&idtmw=1661241
-- Text  focus Serie A, punti nel 2022: Juve prima, ma un solo punto in più rispetto a Milan e Napoli
------------------------------
-- Time  11:24
-- Href  https://www.tuttomercatoweb.com/atalanta/?action=read&idtmw=1661233
-- Text  focus Serie A, chi più in forma? Le ultime 5 gare: Sassuolo e Juve in vetta, crisi Venezia
------------------------------
-- Time  11:15
-- Href  https://www.tuttomercatoweb.com/atalanta/?action=read&idtmw=1661229
-- Text  Le pagelle di Cissé: come nelle migliori favole. Dalla seconda categoria al gol in serie A
------------------------------
...
...

EDIT:

Why headers are required here ? How to use Python requests to fake a browser visit a.k.a and generate User Agent?

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement