Skip to content
Advertisement

Scraping specific ‘dd’ tags with BeautifulSoup and Python

Im learning beautifulsoup and I came a cross one problem. Thats scraping dd tags in html. Check out the picture below, I want to get the parameters that are in the red color zone. The problem is I do not know how to access them. I have tried this:

    kvadratura = float(nek_html.find('span', class_='d-inline-block mt-auto').text.split(' ')[0])
    jedinica_mere = nek_html.find('span', class_='d-inline-block mt-auto').text.split(' ')[1].strip()
...

enter image description here

But the problem is that sometimes different pages have different parameters, or different order of parameters so I cant access with index. Check out the links:

https://www.nekretnine.rs/stambeni-objekti/stanovi/centar-zmaj-jovina-salonac-id1003/NkmUEzjEFo0/

https://www.nekretnine.rs/stambeni-objekti/stanovi/prodajemo-stan-milica-od-macve-mirijevo-46m2-nov/NkNruPymNHy/

How can I sure that I will always scrape the parameter that I want? Each parameter goes into the list afterwards so If some parameter does now exist, it should add '' to the list

Advertisement

Answer

In such cases, this is something you might wanna do instead of using index as the latter may lead you to the wrong dd. When you go for the following approach, all you need to do is replace the text within :contains('') to get their dd, as in Transakcija,Vrsta stana and so on..

import requests
from bs4 import BeautifulSoup

url = "https://www.nekretnine.rs/stambeni-objekti/stanovi/zemun-krajiska-41m-bela-fasadna-cila-odlican/NkiRX4sq4Cy/"

res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
Kategorija = soup.select_one(".base-inf .dl-horozontal:has(:contains('Kategorija:')) > dd")
Kategorija = Kategorija.get_text(strip=True) if Kategorija else ""
print(Kategorija)
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement