Skip to content
Advertisement

Python – Selenium – webscrape table with text in html using WebDriverWait

I try to webscrape all the Company Names with 500 or more employees of the following website:

https://de.statista.com/companydb/suche?idCountry=276&idBranch=0&revenueFrom=-1000000000000000000&revenueTo=1000000000000000000&employeesFrom=500&employeesTo=100000000&sortMethod=revenueDesc&p=1

I wrote a code to scrape the Company Names of the the first site and the script will then click on the “Next Site Button” and scrape again the names. The names will be saved into a list, and this will happen until the list has a certain number of names in it. Then it will transfer the list into a dataframe and export it into an xslfile. Unfortunately it does not do this at the moment. Here is the Code

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
import pandas as pd
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

company_list = []

driver = webdriver.Chrome('/Users/rieder/Anaconda3/chromedriver_win32/chromedriver.exe')

driver.get('https://de.statista.com/companydb/suche?idCountry=276&idBranch=0&revenueFrom=-1000000000000000000&revenueTo=1000000000000000000&employeesFrom=500&employeesTo=100000000&sortMethod=revenueDesc&p=1')

driver.find_element_by_id("cookiesNotificationConfirm").click();

while len(company_list) < 20:

    company_name = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='zebraTable zebraTable--companies']//following::tr[2]/td[@class='zebraTable__td zebraTable__td--companyName']/a"))).get_attribute("innerHTML")
    
    for p in range(len(company_name)):
        company_list.append(company_name)
        
    driver.find_element_by_xpath("//*[@id='content']/section[3]/div/div/form/div/div[2]/div[2]/div[2]/div/button[2]").click();
              
    print(company_list)

    df = pd.DataFrame(company_list,columns =['Unternehmensname']) 

    df.to_excel("output.xlsx")  
            
    time.sleep(5)

And my Output looks like this:

['n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ', 'n                        Progress-Werk Oberkirch AGn                    ']

I think its because the .get_attribute() only gets one attribute, but i dont know how to get all the attributes at this points.

inb4 Thanks

Advertisement

Answer

Yes using .get_attribute() you can only get one attribute at a time. To get all attributes you can below javascript code:

driver.execute_script('var items = {}; for (index = 0; index < arguments[0].attributes.length; ++index) { items[arguments[0].attributes[index].name] = arguments[0].attributes[index].value }; return items;', ele)

Here ele is your webelement.

To Print all the company name you can use below approach:

company_names = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//td[@class='zebraTable__td zebraTable__td--companyName']")))
for cn in company_names:
    print(cn.text)

Note : It will print all the company names on first page. If you want to get names from all the page then you need to click on next page icon on each page and click above code in a loop.

User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement