How do I run through a list of links one by one and then scrape data using selenium(driver.get)?

Question

I'm trying to loop through 2 sets of links. Starting with https://cuetracker.net/seasons > click through each season link (Last 5 seasons) and then click through each tournament link within each season link and scrape the match data from each tournament. Using the below code I have managed to get a list of season links I desire but then when I

Accepted Answer

You are pretty close and you are right about &#8220;you just need to wait a bit&#8221;. You could wait for page load: wait_for_page_load checks the document readystate and if everything is loaded then you are good to go. Check this thread for more. :) from selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditionsfrom selenium.webdriver.support.select import Selectfrom bs4 import BeautifulSoupimport osimport reimport timeimport pandas as pddef wait_for_page_load():    timer = 10    start_time = time.time()    page_state = None    while page_state != 'complete':        time.sleep(0.5)        page_state = Browser.execute_script('return document.readyState;')        if time.time() - start_time > timer:            raise Exception('Timeout :(')Chrome_Path = r"C:UsersGeorgeDesktopchromedriver.exe"Browser = webdriver.Chrome()Browser.get("https://cuetracker.net/seasons")links = Browser.find_elements_by_css_selector("table.table.table-striped a")hrefs=[]for link in links:    hrefs.append(link.get_attribute("href"))hrefs = hrefs[1:5]hrefs2 = {}for href in hrefs:    hrefs2[href] = []    Browser.get(href)    wait_for_page_load()    links2 = Browser.find_elements_by_partial_link_text("20")    for link in links2:        hrefs2[href].append((link.get_attribute("href")))A few notes if you don&#8217;t mind:Browser should be browser or driver, same applies to Chrome_Pathcheck out Xpath, it is awesomeEDIT:I&#8217;ve been sloppy for the first time so I&#8217;ve updated the answer to answer the question :D. Waiting for page load is still a good idea :)The problem was that you re-defined hrefs2 in each cycle so it always contained the result of the last iteration. About why xpath: If you would like to  to load results before 2000, your url collecting logic would break. You could still do this: table = Browser.find_element_by_xpath('//*[@class="table table-striped"]')all_urls = [x.get_attribute('href') for x in table.find_elements_by_xpath('.//tr/td[2]/a')]Where you find the table by the class name, then collect the urls from the second column of the table. If you know the url pattern you can even do this: all_urls = [x.get_attribute('href') for x in Browser.find_elements_by_xpath('//td//a[contains(@href, "https://cuetracker.net/tournaments")]')]The Xpath above: //td <- in any depth of the document tree find td tagged elements//a <- in collected td elements get all children which are a tagged (in any depth)[contains(@href, "https://cuetracker.net/tournaments")] from the list of collected a tagged elements which contain the "https://cuetracker.net/tournaments" text in the href attribute (partial match)

Advertisement

Answer