I’m a webscraping novice and I am looking for pointers of what to do next, or potentially a working solution, to scrape the following webpage: https://www.capology.com/club/leicester/salaries/2019-2020/
I would like to extract the following for each row (player) of the table:
- Player Name i.e. Jamie Vardy
- Weekly Gross Base Salary (in GBP) i.e. £140,000
- Annual Gross Base Salary (in GBP) i.e. £7,280,000
- Position i.e. F
- Age i.e. 33
- Country England
The following code creates the ‘soup’ for the JavaScript table of information I want:
import requests from bs4 import BeautifulSoup import json headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0'} url = 'https://www.capology.com/club/leicester/salaries/2019-2020/' r = requests.get(url) soup = BeautifulSoup(r.content, 'html.parser') script = soup.find_all('script')[11].string # 11th script tag in the webpage
I can see the ‘soup’ assigned to the script
variable has all the information I need, however, I am struggling to extract the information that I need as a pandas DataFrame?
I would subsequently like to set up this up for pagination, to scrape each team in the the ‘Big 5’ European Leagues (Premier League, Serie A, La Liga, Bundeliga, and Ligue 1), for the 17-18, 18-19, 19-20, and 20-21 (current) seasons. However, that’s the final stage solution and I am happy to go away and try and do that myself if that’s a time consuming request.
A working solution would be fantastic but just some pointers so that I can go away and learn this stuff myself as efficiently as possible would be great.
Thanks very much!
Advertisement
Answer
This is a task that is best suited for a tool like selenium
, as the site uses the scrip to populate the page with the table after it loads, and it is not trivial to parse the values from the script
source:
from selenium import webdriver from bs4 import BeautifulSoup as soup import urllib.parse, collections, re d = webdriver.Chrome('/path/to/chromedriver') d.get((url:='https://www.capology.com/club/leicester/salaries/2019-2020/')) league_teams = d.execute_script(""" var results = []; for (var i of Array.from(document.querySelectorAll('li.green-subheader + li')).slice(0, 5)){ results.push({league:i.querySelector('.league-title').textContent, teams:Array.from(i.querySelectorAll('select:nth-of-type(1).team-menu option')).map(x => [x.getAttribute('value'), x.textContent]).slice(1), years:Array.from(i.querySelectorAll('select:nth-of-type(2).team-menu option')).map(x => [x.getAttribute('value'), x.textContent]).slice(2)}) } return results; """) vals = collections.defaultdict(dict) for i in league_teams: for y, full_year in [[re.sub('d{4}-d{4}', '2020-2021', i['years'][0][0]), '2020-21'], *i['years']][:4]: for t, team in i['teams']: d.get(urllib.parse.urljoin(url, t) + (y1:=re.findall('/d{4}-d{4}/', y)[0][1:])) hvals = [x.get_text(strip=True) for x in soup(d.page_source, 'html.parser').select('#table thead tr:nth-of-type(3) th')] tvals = soup(d.page_source, 'html.parser').select('#table tbody tr') full_table = [dict(zip(hvals, [j.get_text(strip=True) for j in k.select('td')])) for k in tvals] if team not in vals[i['league']]: vals[i['league']][team] = {full_year:None} vals[i['league']][team][full_year] = full_table