Skip to content
Advertisement

How do I run through a list of links one by one and then scrape data using selenium(driver.get)?

I’m trying to loop through 2 sets of links. Starting with https://cuetracker.net/seasons > click through each season link (Last 5 seasons) and then click through each tournament link within each season link and scrape the match data from each tournament.

Using the below code I have managed to get a list of season links I desire but then when I try and grab the tournament links and put them into a list it is only getting the last season tournament links as opposed to each season’s.

I’d guess it’s something to do with driver.get just completing before the next lines of code work and I need to loop/iterate using indexes but I’m a complete novice so I’m not too sure.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.support.select import Select
from bs4 import BeautifulSoup
import re
import pandas as pd
import os

Chrome_Path = r"C:UsersGeorgeDesktopchromedriver.exe"
Browser = webdriver.Chrome(Chrome_Path)

Browser.get("https://cuetracker.net/seasons")


links = Browser.find_elements_by_css_selector("table.table.table-striped a")
hrefs=[]
for link in links:
    hrefs.append(link.get_attribute("href"))

hrefs = hrefs[1:5]

for href in hrefs:
    Browser.get(href)
    links2 = Browser.find_elements_by_partial_link_text("20")
    hrefs2 =[]
    for link in links2:
        hrefs2.append(link.get_attribute("href"))

Advertisement

Answer

You are pretty close and you are right about “you just need to wait a bit”.

You could wait for page load: wait_for_page_load checks the document readystate and if everything is loaded then you are good to go. Check this thread for more. :)

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.support.select import Select
from bs4 import BeautifulSoup

import os
import re
import time
import pandas as pd


def wait_for_page_load():
    timer = 10
    start_time = time.time()
    page_state = None
    while page_state != 'complete':
        time.sleep(0.5)
        page_state = Browser.execute_script('return document.readyState;')
        if time.time() - start_time > timer:
            raise Exception('Timeout :(')


Chrome_Path = r"C:UsersGeorgeDesktopchromedriver.exe"
Browser = webdriver.Chrome()

Browser.get("https://cuetracker.net/seasons")


links = Browser.find_elements_by_css_selector("table.table.table-striped a")
hrefs=[]
for link in links:
    hrefs.append(link.get_attribute("href"))

hrefs = hrefs[1:5]

hrefs2 = {}

for href in hrefs:
    hrefs2[href] = []
    Browser.get(href)
    wait_for_page_load()
    links2 = Browser.find_elements_by_partial_link_text("20")
    for link in links2:
        hrefs2[href].append((link.get_attribute("href")))

A few notes if you don’t mind:

  • Browser should be browser or driver, same applies to Chrome_Path
  • check out Xpath, it is awesome

EDIT:

I’ve been sloppy for the first time so I’ve updated the answer to answer the question :D. Waiting for page load is still a good idea :)

The problem was that you re-defined hrefs2 in each cycle so it always contained the result of the last iteration.

About why xpath:

If you would like to to load results before 2000, your url collecting logic would break. You could still do this:

table = Browser.find_element_by_xpath('//*[@class="table table-striped"]')
all_urls = [x.get_attribute('href') for x in table.find_elements_by_xpath('.//tr/td[2]/a')]

Where you find the table by the class name, then collect the urls from the second column of the table.

If you know the url pattern you can even do this:

all_urls = [x.get_attribute('href') for x in Browser.find_elements_by_xpath('//td//a[contains(@href, "https://cuetracker.net/tournaments")]')]

The Xpath above:

  • //td <- in any depth of the document tree find td tagged elements
  • //a <- in collected td elements get all children which are a tagged (in any depth)
  • [contains(@href, "https://cuetracker.net/tournaments")] from the list of collected a tagged elements which contain the "https://cuetracker.net/tournaments" text in the href attribute (partial match)
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement