Skip to content
Advertisement

String filtering in an if function not working in Python

I am writing a webscraper that scrapes data from a list of links one after the other. The problem is that the website uses the same class names for up to 3 different buttons at once with no other unique identifiers used which to my understanding makes it impossible to point to the exact button if there are more.

I used the driver.find.element which worked well since it just found the first result and basically ignored the other buttons. However, on some pages, the information the offers information that I am trying to scrape is missing which results in the script picking up wrong data and filling it in even though I am not interested in that data at all.

So I went out with a solution that checks whether the scraped information contains a specific string that only appears for that one piece of information that I am trying to get and if the string is not found the data variable should get overwritten with empty data so that it would be obvious that the information doesn’t exist.

However, during the process the if statement that I am trying to filter the strings with doesn’t seem to work at all. When there are no buttons on the webpage it indeed manages to fill in the variable with empty data. However, once a different button appears it’s not filtered and gets through somehow and ruins the whole thing.

This is an example webpage which doesn’t contain the data at all :

https://reality.idnes.cz/rk/detail/nido-group-s-r-o/5a85b108a26e3a2adb4e394c/?page=185

This is an example webpage that contains 2 buttons with data the first of which I am trying to scrape look for the “nemovitostí” text in the blue button that’s what I am trying to filter.

https://reality.idnes.cz/rk/detail/m-m-reality-holding-a-s/5a85b582a26e3a321d4f2700/

This is the problematic code :

# Offers
        offers = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "span.btn__text")))
        offers = offers.text
        print(offers)
        # Check if scraped information contains offers else move on
        if "nemovitostí" or "nemovitosti" or "nemovitost" in offers:
            pass
        else:
            offers = ""

Since the if statement should supposedly look for the set of strings and otherwise if not found should execute any other code under the else statement I can’t seem to understand how is it possible that the data gets in at all. There are no error codes or warning it just picks up the data instead of ignoring it even if the string is different.

This is more of the code for reference :

# Open links.csv file and read it's contents
with open('links.csv') as read:
    reader = csv.reader(read)
    link_list = list(reader)
    # Information search
    for link in link_list:
        driver.get(', '.join(link))
        # Title
        title = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "h1.b-annot__title.mb-5")))
        # Offers
        offers = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "span.btn__text")))
        offers = offers.text
        print(offers)
        # Check if scraped information contains offers else move on
        if "nemovitostí" or "nemovitosti" or "nemovitost" in offers:
            None
        else:
            offers = ""
        # Address
        address = wait.until(ec.presence_of_element_located((By.CSS_SELECTOR, "p.font-sm")))
        # Phone number
        # Try to obtain phone number if nonexistent move on
        try:
            phone_number = wait.until(ec.presence_of_element_located((By.XPATH, "//a[./span[contains(@class, 'icon icon--phone')]]")))
            phone_number = phone_number.text
        except TimeoutException:
            phone_number = ""
        # Email
        # Try to obtain email if nonexistent move on
        try:
            email = wait.until(ec.presence_of_element_located((By.XPATH, "//a[./span[contains(@class, 'icon icon--email')]]")))
            email = email.text
        except TimeoutException:
            email = ""
        # Print scraping results
        print(title.text, " ", offers, " ", address.text, " ", phone_number, " ", email)
        # Save results to a list
        company = [title.text, offers, address.text, phone_number, email]
        # Write results to scraped.xlsx file
        worksheet.write_row(row, 0, company)
        del title, offers, address, phone_number, email
        # Push row number lower
        row += 1
    workbook.close()
    driver.quit()

How is it possible that the data still gets through? Is there an error in my syntax? If you saw my mistake please let me know so I can get better next time! Thanks to anyone for any sort of help!

Advertisement

Answer

1. The problem is that the website uses the same class names for up to 3 different buttons at once with no other unique identifiers used which to my understanding makes it impossible to point to the exact button if there are more

You can actually get the element you need if you use By.XPATH instead By.CSS_SELECTOR. First would be (//span[@class='btn__text'])[1], second (//span[@class='btn__text'])[2] and third (//span[@class='btn__text'])[3] Or if you are not sure what the order would be, you can be more specific like (//span[@class='btn__text' and contains(text(),'nemovitostí')])

2. Second problem is related to if syntax in python

It should be like this

if "nemovitostí" in offers or "nemovitosti"  in offers or "nemovitost" in offers:

There might be a nicer way to write this, maybe something like this:

for i in ["nemovitostí" , "nemovitosti" , "nemovitost"]:
    if i in offers:
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement