I’m trying to loop through 2 sets of links. Starting with https://cuetracker.net/seasons > click through each season link (Last 5 seasons) and then click through each tournament link within each season link and scrape the match data from each tournament.
Using the below code I have managed to get a list of season links I desire but then when I try and grab the tournament links and put them into a list it is only getting the last season tournament links as opposed to each season’s.
I’d guess it’s something to do with driver.get just completing before the next lines of code work and I need to loop/iterate using indexes but I’m a complete novice so I’m not too sure.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.support.select import Select
from bs4 import BeautifulSoup
import re
import pandas as pd
import os
Chrome_Path = r"C:UsersGeorgeDesktopchromedriver.exe"
Browser = webdriver.Chrome(Chrome_Path)
Browser.get("https://cuetracker.net/seasons")
links = Browser.find_elements_by_css_selector("table.table.table-striped a")
hrefs=[]
for link in links:
hrefs.append(link.get_attribute("href"))
hrefs = hrefs[1:5]
for href in hrefs:
Browser.get(href)
links2 = Browser.find_elements_by_partial_link_text("20")
hrefs2 =[]
for link in links2:
hrefs2.append(link.get_attribute("href"))
Advertisement
Answer
You are pretty close and you are right about “you just need to wait a bit”.
You could wait for page load: wait_for_page_load
checks the document readystate and if everything is loaded then you are good to go. Check this thread for more. :)
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.support.select import Select
from bs4 import BeautifulSoup
import os
import re
import time
import pandas as pd
def wait_for_page_load():
timer = 10
start_time = time.time()
page_state = None
while page_state != 'complete':
time.sleep(0.5)
page_state = Browser.execute_script('return document.readyState;')
if time.time() - start_time > timer:
raise Exception('Timeout :(')
Chrome_Path = r"C:UsersGeorgeDesktopchromedriver.exe"
Browser = webdriver.Chrome()
Browser.get("https://cuetracker.net/seasons")
links = Browser.find_elements_by_css_selector("table.table.table-striped a")
hrefs=[]
for link in links:
hrefs.append(link.get_attribute("href"))
hrefs = hrefs[1:5]
hrefs2 = {}
for href in hrefs:
hrefs2[href] = []
Browser.get(href)
wait_for_page_load()
links2 = Browser.find_elements_by_partial_link_text("20")
for link in links2:
hrefs2[href].append((link.get_attribute("href")))
A few notes if you don’t mind:
Browser
should bebrowser
ordriver
, same applies toChrome_Path
- check out Xpath, it is awesome
EDIT:
I’ve been sloppy for the first time so I’ve updated the answer to answer the question :D. Waiting for page load is still a good idea :)
The problem was that you re-defined hrefs2
in each cycle so it always contained the result of the last iteration.
About why xpath:
If you would like to to load results before 2000, your url collecting logic would break. You could still do this:
table = Browser.find_element_by_xpath('//*[@class="table table-striped"]')
all_urls = [x.get_attribute('href') for x in table.find_elements_by_xpath('.//tr/td[2]/a')]
Where you find the table by the class name, then collect the urls from the second column of the table.
If you know the url pattern you can even do this:
all_urls = [x.get_attribute('href') for x in Browser.find_elements_by_xpath('//td//a[contains(@href, "https://cuetracker.net/tournaments")]')]
The Xpath above:
//td
<- in any depth of the document tree findtd
tagged elements//a
<- in collectedtd
elements get all children which area
tagged (in any depth)[contains(@href, "https://cuetracker.net/tournaments")]
from the list of collecteda
tagged elements which contain the"https://cuetracker.net/tournaments"
text in thehref
attribute (partial match)