I’m trying to loop through 2 sets of links. Starting with https://cuetracker.net/seasons > click through each season link (Last 5 seasons) and then click through each tournament link within each season link and scrape the match data from each tournament.
Using the below code I have managed to get a list of season links I desire but then when I try and grab the tournament links and put them into a list it is only getting the last season tournament links as opposed to each season’s.
I’d guess it’s something to do with driver.get just completing before the next lines of code work and I need to loop/iterate using indexes but I’m a complete novice so I’m not too sure.
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions from selenium.webdriver.support.select import Select from bs4 import BeautifulSoup import re import pandas as pd import os Chrome_Path = r"C:UsersGeorgeDesktopchromedriver.exe" Browser = webdriver.Chrome(Chrome_Path) Browser.get("https://cuetracker.net/seasons") links = Browser.find_elements_by_css_selector("table.table.table-striped a") hrefs=[] for link in links: hrefs.append(link.get_attribute("href")) hrefs = hrefs[1:5] for href in hrefs: Browser.get(href) links2 = Browser.find_elements_by_partial_link_text("20") hrefs2 =[] for link in links2: hrefs2.append(link.get_attribute("href"))
Advertisement
Answer
You are pretty close and you are right about “you just need to wait a bit”.
You could wait for page load: wait_for_page_load
checks the document readystate and if everything is loaded then you are good to go. Check this thread for more. :)
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions from selenium.webdriver.support.select import Select from bs4 import BeautifulSoup import os import re import time import pandas as pd def wait_for_page_load(): timer = 10 start_time = time.time() page_state = None while page_state != 'complete': time.sleep(0.5) page_state = Browser.execute_script('return document.readyState;') if time.time() - start_time > timer: raise Exception('Timeout :(') Chrome_Path = r"C:UsersGeorgeDesktopchromedriver.exe" Browser = webdriver.Chrome() Browser.get("https://cuetracker.net/seasons") links = Browser.find_elements_by_css_selector("table.table.table-striped a") hrefs=[] for link in links: hrefs.append(link.get_attribute("href")) hrefs = hrefs[1:5] hrefs2 = {} for href in hrefs: hrefs2[href] = [] Browser.get(href) wait_for_page_load() links2 = Browser.find_elements_by_partial_link_text("20") for link in links2: hrefs2[href].append((link.get_attribute("href")))
A few notes if you don’t mind:
Browser
should bebrowser
ordriver
, same applies toChrome_Path
- check out Xpath, it is awesome
EDIT:
I’ve been sloppy for the first time so I’ve updated the answer to answer the question :D. Waiting for page load is still a good idea :)
The problem was that you re-defined hrefs2
in each cycle so it always contained the result of the last iteration.
About why xpath:
If you would like to to load results before 2000, your url collecting logic would break. You could still do this:
table = Browser.find_element_by_xpath('//*[@class="table table-striped"]') all_urls = [x.get_attribute('href') for x in table.find_elements_by_xpath('.//tr/td[2]/a')]
Where you find the table by the class name, then collect the urls from the second column of the table.
If you know the url pattern you can even do this:
all_urls = [x.get_attribute('href') for x in Browser.find_elements_by_xpath('//td//a[contains(@href, "https://cuetracker.net/tournaments")]')]
The Xpath above:
//td
<- in any depth of the document tree findtd
tagged elements//a
<- in collectedtd
elements get all children which area
tagged (in any depth)[contains(@href, "https://cuetracker.net/tournaments")]
from the list of collecteda
tagged elements which contain the"https://cuetracker.net/tournaments"
text in thehref
attribute (partial match)