Skip to content
Advertisement

Appending elements of a list into a multi-dimensional list

Hi I’m doing some web scraping with NBA Data in python on this page. Some elements of basketball-reference are easy to scrape, but this one is giving me some trouble with my lack of python knowledge.

I’m able to grab the data and column headers I want, but I end up with 2 lists of data that I need to combine by their index (i think?) so that index 0 of player_injury_info lines up with index 0 of player_names etc, which I dont know how to do.

Below I’ve pasted some code that you can follow along.

from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime, timezone, timedelta

url = "https://www.basketball-reference.com/friv/injuries.fcgi"
html = urlopen(url)
soup = BeautifulSoup(html)

# this correctly gives me the 4 column headers i want (Player, Team, Update, Description)
headers = [th.getText() for th in soup.findAll('tr', limit=2)[0].findAll('th')]

# 2 lists - player_injury_info and player_names.  they need to be combined.
rows = soup.findAll('tr')
player_injury_info = [[td.getText() for td in rows[i].findAll('td')]
            for i in range(len(rows))]
player_injury_info = player_injury_info[1:] # removing first element bc dont need it

player_names = [[th.getText() for th in rows[i].findAll('th')]
            for i in range(len(rows))]
player_names = player_names[1:]             # removing first element bc dont need it

### joining the lists in the correct order- the part i dont know how to do
player_list = player_names.append(player_injury_info)

### this should give me the data frame i want if i can get player_injury_info into the right format.
injury_data = pd.DataFrame(player_injury_info, columns = headers)

There might be an easier way to web scrape the data into all 1 list / data frame? Or maybe it’s fine to just join the 2 lists together like I’m trying to do. But if anybody was able to follow along and can offer a solution I’d appreciate the help!

Advertisement

Answer

Let pandas do the parse of the table for you.

import pandas as pd

url = "https://www.basketball-reference.com/friv/injuries.fcgi"
injury_data = pd.read_html(url)[0]

Output:

print(injury_data)
              Player  ...                                        Description
0     Onyeka Okongwu  ...  Out (Shoulder) - The Hawks announced that Okon...
1       Jaylen Brown  ...  Out (Wrist) - The Celtics announced that Brown...
2         Coby White  ...  Out (Shoulder) - The Bulls announced that Whit...
3     Taurean Prince  ...  Out (Ankle) - The Cavaliers announced F Taurea...
4       Jamal Murray  ...  Out (Knee) - Murray is recovering from a torn ...
5      Klay Thompson  ...  Out (Right Achilles) - Thompson is on track to...
6      James Wiseman  ...  Out (Knee) - Wiseman is on track to be ready b...
7        T.J. Warren  ...  Out (Foot) - Warren underwent foot surgery and...
8        Serge Ibaka  ...  Out (Back) - The Clippers announced Serge Ibak...
9      Kawhi Leonard  ...  Out (Knee) - The Clippers announced Kawhi Leon...
10    Victor Oladipo  ...  Out (Knee) - Oladipo could be cleared for full...
11  Donte DiVincenzo  ...  Out (Foot) - DiVincenzo suffered a tendon inju...
12    Jarrett Culver  ...  Out (Ankle) - The Timberwolves announced Culve...
13    Markelle Fultz  ...  Out (Knee) - Fultz will miss the rest of the s...
14    Jonathan Isaac  ...  Out (Knee) - Isaac is making progress with his...
15       Dario Šarić  ...  Out (Knee) - The Suns announced that Sario has...
16      Zach Collins  ...  Out (Ankle) - The Blazers announced that Colli...
17     Pascal Siakam  ...  Out (Shoulder) - The Raptors announced Pascal ...
18       Deni Avdija  ...  Out (Leg) - The Wizards announced that Avdija ...
19     Thomas Bryant  ...  Out (Left knee) - The Wizards announced that B...

[20 rows x 4 columns]

But if you were to iterate it yourself, I’d simply get at the rows (<tr> tags), then get the player name in the <a> tag, and combine it with that row’s <td> tags. Then create your dataframe from the list of those:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime, timezone, timedelta

url = "https://www.basketball-reference.com/friv/injuries.fcgi"
html = urlopen(url)
soup = BeautifulSoup(html)

headers = [th.getText() for th in soup.findAll('tr', limit=2)[0].findAll('th')]

trs = soup.findAll('tr')[1:]
rows = []
for tr in trs:
    player_name = tr.find('a').text
    data = [player_name] + [x.text for x in tr.find_all('td')]
    rows.append(data)

injury_data = pd.DataFrame(rows, columns = headers)
User contributions licensed under: CC BY-SA
7 People found this is helpful
Advertisement