I’m trying to scrape football scores from 8 pages online. For some reason my code is scraping the results from the first page twice, it goes on to scrape the next 6 pages as it should, then leaves out the final page.
Here is my code
from bs4 import BeautifulSoup from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.support.ui import WebDriverWait import time import requests import numpy as np chrome_options = Options() chrome_options.add_argument('headless') driver = webdriver.Chrome(options=chrome_options) wait = WebDriverWait(driver, 10) scores = [] for i in range(1,9,1): url = 'https://www.oddsportal.com/soccer/england/premier-league-2020-2021/results/#/page/' + str(i) + '/' time.sleep(5) driver.get(url) soup = BeautifulSoup(driver.page_source, 'lxml') main_table = soup.find('table', class_ ='table-main') rows_of_interest = main_table.find_all('tr', class_ = ['odd deactivate', 'deactivate']) score = row.find('td', class_ = 'center bold table-odds table-score').text scores.append(score)
Help would be much appreciated
EDIT:
I fixed it by shifting the loop up by 1
for i in range(2,10,1):
I still have no idea why this works because the page numbers are 1-8
Advertisement
Answer
You should put a delay between driver.get(url)
and soup = BeautifulSoup(driver.page_source, 'lxml')
to let the new page loaded.
Without that the first iteration reads the first page correctly since
soup = BeautifulSoup(driver.page_source, 'lxml')
action waits for page (any) to be loaded before scraping it content, but in the second iteration you will read the content of the first page again since the second page is still not loaded.
The time.sleep(5)
command in it’s wrong locating will cause all the next pages to be scraped but with delay of 1 iteration causing the last page to not being scraped.
With delay at the correct place it will work correctly
for i in range(1,9,1): url = 'https://www.oddsportal.com/soccer/england/premier-league-2020-2021/results/#/page/' + str(i) + '/' driver.get(url) time.sleep(5) soup = BeautifulSoup(driver.page_source, 'lxml')