I am currently learning Python in order to webscrape and am running into an issue with my current script. After closing the pop-up on Page 2 of Indeed and cycling through the pages, the script only returns one page into the data frame to CSV. However, it does print out each page in my terminal area. It also on occasion only returns part of the data from a page. EX page 2 will return info for the first 3 jobs as part of my print(df_da), but nothing for the next 12. Additionally, it seems to take a very long time to run the script (averaging around 6 minutes and 45 seconds for the 5 pages, around 1 minute to 1.5 minutes per page). Any suggestions? I’ve attached my script and can also attach the return I get from my Print(df_da) if needed below. Thank you in advance!
import pandas as pd from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.ui import WebDriverWait options = Options() options.add_argument("window-size=1400,1400") PATH = "C://Program Files (x86)//chromedriver.exe" driver = webdriver.Chrome(PATH) for i in range(0,50,10): driver.get('https://www.indeed.com/jobs?q=chemical%20engineer&l=united%20states&start='+str(i)) driver.implicitly_wait(5) jobtitles = [] companies = [] locations = [] descriptions = [] jobs = driver.find_elements_by_class_name("slider_container") for job in jobs: jobtitle = job.find_element_by_class_name('jobTitle').text.replace("new", "").strip() jobtitles.append(jobtitle) company = job.find_element_by_class_name('companyName').text.replace("new", "").strip() companies.append(company) location = job.find_element_by_class_name('companyLocation').text.replace("new", "").strip() locations.append(location) description = job.find_element_by_class_name('job-snippet').text.replace("new", "").strip() descriptions.append(description) try: WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "button.popover-x-button-close.icl-CloseButton"))).click() except: pass df_da=pd.DataFrame() df_da['JobTitle']=jobtitles df_da['Company']=companies df_da['Location']=locations df_da['Description']=descriptions print(df_da) df_da.to_csv('C:/Users/Dan/Desktop/AZNext/file_name1.csv')
Advertisement
Answer
You are defining the df_da
inside the outer for
loop so that the df_da
will contain the data from the last page only.
You should define it out of the loops and put the total data there only after all the data have been collected.
I guess you are getting not all the jobs on the second page because of the pop-up. So, you should close it before collecting the job details on that page.
Also, you can reduce waiting for the pop-up element from all the loop iterations and leave it for the second loop iteration only.
Your code can be something like this:
import pandas as pd from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.ui import WebDriverWait options = Options() options.add_argument("window-size=1400,1400") PATH = "C://Program Files (x86)//chromedriver.exe" driver = webdriver.Chrome(PATH) jobtitles = [] companies = [] locations = [] descriptions = [] for i in range(0,50,10): driver.get('https://www.indeed.com/jobs?q=chemical%20engineer&l=united%20states&start='+str(i)) driver.implicitly_wait(5) jobs = driver.find_elements_by_class_name("slider_container") for idx, job in enumerate(jobs): if(idx == 1): try: WebDriverWait(driver, 5).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "button.popover-x-button-close.icl-CloseButton"))).click() except: pass jobtitle = job.find_element_by_class_name('jobTitle').text.replace("new", "").strip() jobtitles.append(jobtitle) company = job.find_element_by_class_name('companyName').text.replace("new", "").strip() companies.append(company) location = job.find_element_by_class_name('companyLocation').text.replace("new", "").strip() locations.append(location) description = job.find_element_by_class_name('job-snippet').text.replace("new", "").strip() descriptions.append(description) df_da=pd.DataFrame() df_da['JobTitle']=jobtitles df_da['Company']=companies df_da['Location']=locations df_da['Description']=descriptions print(df_da) df_da.to_csv('C:/Users/Dan/Desktop/AZNext/file_name1.csv')