I am writing a webscraper that scrapes data from a list of links one after the other. The problem is that the website uses the same class names for up to 3 different buttons at once with no other unique identifiers used which to my understanding makes it impossible to point to the exact button if there are more.
I used the driver.find.element which worked well since it just found the first result and basically ignored the other buttons. However, on some pages, the information the offers information that I am trying to scrape is missing which results in the script picking up wrong data and filling it in even though I am not interested in that data at all.
So I went out with a solution that checks whether the scraped information contains a specific string that only appears for that one piece of information that I am trying to get and if the string is not found the data variable should get overwritten with empty data so that it would be obvious that the information doesn’t exist.
However, during the process the if statement that I am trying to filter the strings with doesn’t seem to work at all. When there are no buttons on the webpage it indeed manages to fill in the variable with empty data. However, once a different button appears it’s not filtered and gets through somehow and ruins the whole thing.
This is an example webpage which doesn’t contain the data at all :
https://reality.idnes.cz/rk/detail/nido-group-s-r-o/5a85b108a26e3a2adb4e394c/?page=185
This is an example webpage that contains 2 buttons with data the first of which I am trying to scrape look for the “nemovitostí” text in the blue button that’s what I am trying to filter.
https://reality.idnes.cz/rk/detail/m-m-reality-holding-a-s/5a85b582a26e3a321d4f2700/
This is the problematic code :
# Offers offers = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "span.btn__text"))) offers = offers.text print(offers) # Check if scraped information contains offers else move on if "nemovitostí" or "nemovitosti" or "nemovitost" in offers: pass else: offers = ""
Since the if statement should supposedly look for the set of strings and otherwise if not found should execute any other code under the else statement I can’t seem to understand how is it possible that the data gets in at all. There are no error codes or warning it just picks up the data instead of ignoring it even if the string is different.
This is more of the code for reference :
# Open links.csv file and read it's contents with open('links.csv') as read: reader = csv.reader(read) link_list = list(reader) # Information search for link in link_list: driver.get(', '.join(link)) # Title title = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "h1.b-annot__title.mb-5"))) # Offers offers = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "span.btn__text"))) offers = offers.text print(offers) # Check if scraped information contains offers else move on if "nemovitostí" or "nemovitosti" or "nemovitost" in offers: None else: offers = "" # Address address = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "p.font-sm"))) # Phone number # Try to obtain phone number if nonexistent move on try: phone_number = wait.until(ec.presence_of_element_located((By.XPATH, "//a[./span[contains(@class, 'icon icon--phone')]]"))) phone_number = phone_number.text except TimeoutException: phone_number = "" # Email # Try to obtain email if nonexistent move on try: email = wait.until(ec.presence_of_element_located((By.XPATH, "//a[./span[contains(@class, 'icon icon--email')]]"))) email = email.text except TimeoutException: email = "" # Print scraping results print(title.text, " ", offers, " ", address.text, " ", phone_number, " ", email) # Save results to a list company = [title.text, offers, address.text, phone_number, email] # Write results to scraped.xlsx file worksheet.write_row(row, 0, company) del title, offers, address, phone_number, email # Push row number lower row += 1 workbook.close() driver.quit()
How is it possible that the data still gets through? Is there an error in my syntax? If you saw my mistake please let me know so I can get better next time! Thanks to anyone for any sort of help!
Advertisement
Answer
1. The problem is that the website uses the same class names for up to 3 different buttons at once with no other unique identifiers used which to my understanding makes it impossible to point to the exact button if there are more
You can actually get the element you need if you use By.XPATH instead By.CSS_SELECTOR.
First would be (//span[@class='btn__text'])[1]
, second (//span[@class='btn__text'])[2]
and third (//span[@class='btn__text'])[3]
Or if you are not sure what the order would be, you can be more specific like
(//span[@class='btn__text' and contains(text(),'nemovitostí')])
2. Second problem is related to if syntax in python
It should be like this
if "nemovitostí" in offers or "nemovitosti" in offers or "nemovitost" in offers:
There might be a nicer way to write this, maybe something like this:
for i in ["nemovitostí" , "nemovitosti" , "nemovitost"]: if i in offers: