I am trying to see if I can use, and only use, Pandas’ read_html function to scrape HTML tables from the following website: https://www.baseball-reference.com/teams/ATL/2021.shtml
I can fulfil my needs using selenium/bs but want to see if I can scrape this site’s tables with just pd.read_html alone.
Currently, pd.read_html returns the first two tables, but is not able to access tables past the second table.
Here is an example of a table ‘id’ that I am trying to access: ‘the40man’
And my code, which returns ‘ValueError: No tables found’:
pd.read_html("https://www.baseball-reference.com/teams/ATL/2021.shtml", attrs = {'id': 'the40man'})
The following code returns the first two tables, {‘id’: [‘team_batting’, ‘team_pitching’]}, but nothing more:
pd.read_html("https://www.baseball-reference.com/teams/ATL/2021.shtml")
I am asking this question out of curiosity in case I’m missing something on my end. If not, this issue is likely due to pd.read_html’s limitations.
Thank you in advance for any input/pd.read_html tips!
Advertisement
Answer
The reference.com sites have some of those tables within the comments of the html. To pull those table out, you need to first pull out the comments. Then you can iterate through those to get the table you want:
import requests from bs4 import BeautifulSoup, Comment import pandas as pd url = 'https://www.baseball-reference.com/teams/ATL/2021.shtml' result = requests.get(url).text data = BeautifulSoup(result, 'html.parser') comments = data.find_all(string=lambda text: isinstance(text, Comment)) tables = [] for each in comments: if 'table' in str(each): try: tables.append(pd.read_html(str(each), attrs = {'id': 'the40man'})[0]) break except: continue
Output:
print(tables[0]) Rk Uni Name Unnamed: 3 ... Ht Wt DoB 1stYr 0 1 30 Kyle Wright us US ... 6' 4" 215 Oct 2, 1995 2015 1 2 0 William Woods us US ... 6' 3" 190 Dec 29, 1998 2018 2 3 51 Will Smith us US ... 6' 5" 255 Jul 10, 1989 2008 3 4 68 Tyler Matzek us US ... 6' 3" 230 Oct 19, 1990 2010 4 5 64 Tucker Davidson us US ... 6' 2" 215 Mar 25, 1996 2016 5 6 62 Touki Toussaint us US ... 6' 3" 215 Jun 20, 1996 2014 6 7 65 Spencer Strider us US ... 6' 0" 195 Oct 28, 1998 2018 7 8 15 Sean Newcomb us US ... 6' 5" 255 Jun 12, 1993 2012 8 9 40 Mike Soroka ca CA ... 6' 5" 225 Aug 4, 1997 2015 9 10 54 Max Fried us US ... 6' 4" 190 Jan 18, 1994 2012 10 11 77 Luke Jackson us US ... 6' 2" 210 Aug 24, 1991 2011 11 12 33 A.J. Minter us US ... 6' 0" 215 Sep 2, 1993 2013 12 13 0 Kirby Yates us US ... 5' 10" 205 Mar 25, 1987 2009 13 14 0 Jay Jackson us US ... 6' 1" 195 Oct 27, 1987 2008 14 15 71 Jacob Webb us US ... 6' 2" 210 Aug 15, 1993 2014 15 16 19 Huascar Ynoa do DO ... 6' 2" 220 May 28, 1998 2015 16 17 36 Ian Anderson us US ... 6' 3" 170 May 2, 1998 2016 17 18 0 Freddy Tarnok us US ... 6' 3" 185 Nov 24, 1998 2017 18 19 74 Dylan Lee us US ... 6' 3" 214 Aug 1, 1994 2015 19 20 0 Alan Rangel mx MX ... 6' 2" 170 Aug 21, 1997 2015 20 21 0 Brooks Wilson us US ... 6' 2" 205 Mar 15, 1996 2015 21 22 50 Charlie Morton us US ... 6' 5" 215 Nov 12, 1983 2002 22 23 14 Adam Duvall us US ... 6' 1" 215 Sep 4, 1988 2010 23 24 24 William Contreras ve VE ... 6' 0" 180 Dec 24, 1997 2015 24 25 27 Austin Riley us US ... 6' 3" 240 Apr 2, 1997 2015 25 26 16 Travis d'Arnaud us US ... 6' 2" 210 Feb 10, 1989 2007 26 27 0 Travis Demeritte us US ... 6' 0" 180 Sep 30, 1994 2013 27 28 0 Chadwick Tromp aw AW ... 5' 8" 221 Mar 21, 1995 2013 28 29 25 Cristian Pache do DO ... 6' 2" 215 Nov 19, 1998 2016 29 30 13 Ronald Acuna Jr. ve VE ... 6' 0" 205 Dec 18, 1997 2015 30 31 1 Ozzie Albies cw CW ... 5' 8" 165 Jan 7, 1997 2014 31 32 9 Orlando Arcia ve VE ... 6' 0" 187 Aug 4, 1994 2011 32 33 7 Dansby Swanson us US ... 6' 1" 190 Feb 11, 1994 2013 33 34 0 Drew Waters us US ... 6' 2" 185 Dec 30, 1998 2017 34 35 20 Marcell Ozuna do DO ... 6' 1" 225 Nov 12, 1990 2008 35 36 0 Manny Pina ve VE ... 6' 0" 222 Jun 5, 1987 2005 36 37 38 Guillermo Heredia cu CU ... 5' 10" 195 Jan 31, 1991 2009 37 38 66 Kyle Muller us US ... 6' 7" 250 Oct 7, 1997 2016 38 Rk Uni Name NaN ... Ht Wt DoB 1stYr [39 rows x 14 columns]