Skip to content
Advertisement

Extracting the required information for a Script tag of scraped webpage using BeautifulSoup

I’m a webscraping novice and I am looking for pointers of what to do next, or potentially a working solution, to scrape the following webpage: https://www.capology.com/club/leicester/salaries/2019-2020/

I would like to extract the following for each row (player) of the table:

  • Player Name i.e. Jamie Vardy
  • Weekly Gross Base Salary (in GBP) i.e. £140,000
  • Annual Gross Base Salary (in GBP) i.e. £7,280,000
  • Position i.e. F
  • Age i.e. 33
  • Country England

The following code creates the ‘soup’ for the JavaScript table of information I want:

import requests
from bs4 import BeautifulSoup
import json

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:75.0) Gecko/20100101 Firefox/75.0'}

url = 'https://www.capology.com/club/leicester/salaries/2019-2020/'

r = requests.get(url)

soup = BeautifulSoup(r.content, 'html.parser')

script = soup.find_all('script')[11].string    # 11th script tag in the webpage

I can see the ‘soup’ assigned to the script variable has all the information I need, however, I am struggling to extract the information that I need as a pandas DataFrame?

I would subsequently like to set up this up for pagination, to scrape each team in the the ‘Big 5’ European Leagues (Premier League, Serie A, La Liga, Bundeliga, and Ligue 1), for the 17-18, 18-19, 19-20, and 20-21 (current) seasons. However, that’s the final stage solution and I am happy to go away and try and do that myself if that’s a time consuming request.

A working solution would be fantastic but just some pointers so that I can go away and learn this stuff myself as efficiently as possible would be great.

Thanks very much!

Advertisement

Answer

This is a task that is best suited for a tool like selenium, as the site uses the scrip to populate the page with the table after it loads, and it is not trivial to parse the values from the script source:

from selenium import webdriver
from bs4 import BeautifulSoup as soup
import urllib.parse, collections, re
d = webdriver.Chrome('/path/to/chromedriver')
d.get((url:='https://www.capology.com/club/leicester/salaries/2019-2020/'))
league_teams = d.execute_script("""
    var results = [];
    for (var i of Array.from(document.querySelectorAll('li.green-subheader + li')).slice(0, 5)){
        results.push({league:i.querySelector('.league-title').textContent, 
        teams:Array.from(i.querySelectorAll('select:nth-of-type(1).team-menu option')).map(x => [x.getAttribute('value'), x.textContent]).slice(1), 
        years:Array.from(i.querySelectorAll('select:nth-of-type(2).team-menu option')).map(x => [x.getAttribute('value'), x.textContent]).slice(2)})
    }
    return results;
""")
vals = collections.defaultdict(dict)
for i in league_teams:
   for y, full_year in [[re.sub('d{4}-d{4}', '2020-2021', i['years'][0][0]), '2020-21'], *i['years']][:4]:
      for t, team in i['teams']:
          d.get(urllib.parse.urljoin(url, t) + (y1:=re.findall('/d{4}-d{4}/', y)[0][1:]))
          hvals = [x.get_text(strip=True) for x in soup(d.page_source, 'html.parser').select('#table thead tr:nth-of-type(3) th')]
          tvals = soup(d.page_source, 'html.parser').select('#table tbody tr')
          full_table = [dict(zip(hvals, [j.get_text(strip=True) for j in k.select('td')])) for k in tvals]
          if team not in vals[i['league']]:
             vals[i['league']][team] = {full_year:None}
          vals[i['league']][team][full_year] = full_table
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement