Hi I’m doing some web scraping with NBA Data in python on this page. Some elements of basketball-reference are easy to scrape, but this one is giving me some trouble with my lack of python knowledge.
I’m able to grab the data and column headers I want, but I end up with 2 lists of data that I need to combine by their index (i think?) so that index 0 of player_injury_info lines up with index 0 of player_names etc, which I dont know how to do.
Below I’ve pasted some code that you can follow along.
from urllib.request import urlopen from bs4 import BeautifulSoup import pandas as pd from datetime import datetime, timezone, timedelta url = "https://www.basketball-reference.com/friv/injuries.fcgi" html = urlopen(url) soup = BeautifulSoup(html) # this correctly gives me the 4 column headers i want (Player, Team, Update, Description) headers = [th.getText() for th in soup.findAll('tr', limit=2)[0].findAll('th')] # 2 lists - player_injury_info and player_names. they need to be combined. rows = soup.findAll('tr') player_injury_info = [[td.getText() for td in rows[i].findAll('td')] for i in range(len(rows))] player_injury_info = player_injury_info[1:] # removing first element bc dont need it player_names = [[th.getText() for th in rows[i].findAll('th')] for i in range(len(rows))] player_names = player_names[1:] # removing first element bc dont need it ### joining the lists in the correct order- the part i dont know how to do player_list = player_names.append(player_injury_info) ### this should give me the data frame i want if i can get player_injury_info into the right format. injury_data = pd.DataFrame(player_injury_info, columns = headers)
There might be an easier way to web scrape the data into all 1 list / data frame? Or maybe it’s fine to just join the 2 lists together like I’m trying to do. But if anybody was able to follow along and can offer a solution I’d appreciate the help!
Advertisement
Answer
Let pandas do the parse of the table for you.
import pandas as pd url = "https://www.basketball-reference.com/friv/injuries.fcgi" injury_data = pd.read_html(url)[0]
Output:
print(injury_data) Player ... Description 0 Onyeka Okongwu ... Out (Shoulder) - The Hawks announced that Okon... 1 Jaylen Brown ... Out (Wrist) - The Celtics announced that Brown... 2 Coby White ... Out (Shoulder) - The Bulls announced that Whit... 3 Taurean Prince ... Out (Ankle) - The Cavaliers announced F Taurea... 4 Jamal Murray ... Out (Knee) - Murray is recovering from a torn ... 5 Klay Thompson ... Out (Right Achilles) - Thompson is on track to... 6 James Wiseman ... Out (Knee) - Wiseman is on track to be ready b... 7 T.J. Warren ... Out (Foot) - Warren underwent foot surgery and... 8 Serge Ibaka ... Out (Back) - The Clippers announced Serge Ibak... 9 Kawhi Leonard ... Out (Knee) - The Clippers announced Kawhi Leon... 10 Victor Oladipo ... Out (Knee) - Oladipo could be cleared for full... 11 Donte DiVincenzo ... Out (Foot) - DiVincenzo suffered a tendon inju... 12 Jarrett Culver ... Out (Ankle) - The Timberwolves announced Culve... 13 Markelle Fultz ... Out (Knee) - Fultz will miss the rest of the s... 14 Jonathan Isaac ... Out (Knee) - Isaac is making progress with his... 15 Dario Šarić ... Out (Knee) - The Suns announced that Sario has... 16 Zach Collins ... Out (Ankle) - The Blazers announced that Colli... 17 Pascal Siakam ... Out (Shoulder) - The Raptors announced Pascal ... 18 Deni Avdija ... Out (Leg) - The Wizards announced that Avdija ... 19 Thomas Bryant ... Out (Left knee) - The Wizards announced that B... [20 rows x 4 columns]
But if you were to iterate it yourself, I’d simply get at the rows (<tr>
tags), then get the player name in the <a>
tag, and combine it with that row’s <td>
tags. Then create your dataframe from the list of those:
from urllib.request import urlopen from bs4 import BeautifulSoup import pandas as pd from datetime import datetime, timezone, timedelta url = "https://www.basketball-reference.com/friv/injuries.fcgi" html = urlopen(url) soup = BeautifulSoup(html) headers = [th.getText() for th in soup.findAll('tr', limit=2)[0].findAll('th')] trs = soup.findAll('tr')[1:] rows = [] for tr in trs: player_name = tr.find('a').text data = [player_name] + [x.text for x in tr.find_all('td')] rows.append(data) injury_data = pd.DataFrame(rows, columns = headers)