I try to webscrape all the Company Names with 500 or more employees of the following website:
I wrote a code to scrape the Company Names of the the first site and the script will then click on the “Next Site Button” and scrape again the names. The names will be saved into a list, and this will happen until the list has a certain number of names in it. Then it will transfer the list into a dataframe and export it into an xslfile. Unfortunately it does not do this at the moment. Here is the Code
from selenium import webdriver from selenium.webdriver.common.keys import Keys from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait import pandas as pd import time from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC company_list = [] driver = webdriver.Chrome('/Users/rieder/Anaconda3/chromedriver_win32/chromedriver.exe') driver.get('https://de.statista.com/companydb/suche?idCountry=276&idBranch=0&revenueFrom=-1000000000000000000&revenueTo=1000000000000000000&employeesFrom=500&employeesTo=100000000&sortMethod=revenueDesc&p=1') driver.find_element_by_id("cookiesNotificationConfirm").click(); while len(company_list) < 20: company_name = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//table[@class='zebraTable zebraTable--companies']//following::tr[2]/td[@class='zebraTable__td zebraTable__td--companyName']/a"))).get_attribute("innerHTML") for p in range(len(company_name)): company_list.append(company_name) driver.find_element_by_xpath("//*[@id='content']/section[3]/div/div/form/div/div[2]/div[2]/div[2]/div/button[2]").click(); print(company_list) df = pd.DataFrame(company_list,columns =['Unternehmensname']) df.to_excel("output.xlsx") time.sleep(5)
And my Output looks like this:
['n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ', 'n Progress-Werk Oberkirch AGn ']
I think its because the .get_attribute() only gets one attribute, but i dont know how to get all the attributes at this points.
inb4 Thanks
Advertisement
Answer
Yes using .get_attribute()
you can only get one attribute at a time. To get all attributes you can below javascript code:
driver.execute_script('var items = {}; for (index = 0; index < arguments[0].attributes.length; ++index) { items[arguments[0].attributes[index].name] = arguments[0].attributes[index].value }; return items;', ele)
Here ele is your webelement.
To Print all the company name you can use below approach:
company_names = WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//td[@class='zebraTable__td zebraTable__td--companyName']"))) for cn in company_names: print(cn.text)
Note : It will print all the company names on first page. If you want to get names from all the page then you need to click on next page icon on each page and click above code in a loop.