Im learning beautifulsoup and I came a cross one problem. Thats scraping dd
tags in html. Check out the picture below, I want to get the parameters that are in the red color zone. The problem is I do not know how to access them. I have tried this:
kvadratura = float(nek_html.find('span', class_='d-inline-block mt-auto').text.split(' ')[0]) jedinica_mere = nek_html.find('span', class_='d-inline-block mt-auto').text.split(' ')[1].strip() ...
But the problem is that sometimes different pages have different parameters, or different order of parameters so I cant access with index. Check out the links:
https://www.nekretnine.rs/stambeni-objekti/stanovi/centar-zmaj-jovina-salonac-id1003/NkmUEzjEFo0/
How can I sure that I will always scrape the parameter that I want?
Each parameter goes into the list afterwards so If some parameter does now exist, it should add ''
to the list
Advertisement
Answer
In such cases, this is something you might wanna do instead of using index as the latter may lead you to the wrong dd. When you go for the following approach, all you need to do is replace the text within :contains('')
to get their dd, as in Transakcija
,Vrsta stana
and so on..
import requests from bs4 import BeautifulSoup url = "https://www.nekretnine.rs/stambeni-objekti/stanovi/zemun-krajiska-41m-bela-fasadna-cila-odlican/NkiRX4sq4Cy/" res = requests.get(url) soup = BeautifulSoup(res.text,"lxml") Kategorija = soup.select_one(".base-inf .dl-horozontal:has(:contains('Kategorija:')) > dd") Kategorija = Kategorija.get_text(strip=True) if Kategorija else "" print(Kategorija)