Skip to content
Advertisement

How can I sift through various ‘a’ tags when scraping a website?

I’m trying to scrape athletic.net, a site that stores track and field times, to get a list for a given athlete of each season, each event that they ran, and every time they got for each event.

So far I have printed the season title and the name of each event. I’m now trying to sift through a sea of a tags to find the times. I’ve tried using find_next('a') and find_next_sibling('a') but am struggling to isolate the times.

for text in soup.find_all('h5'):
    #print season titles and event name neatly
    if "Season" in str(text):
        text_file.write(('n' + 'n' + str(text.contents[0])) + 'n')
    else:
        text_file.write(str(text.contents[0]) + 'n')

        #print all siblings
        for i in range(0,100):
            try:
                text = text.find_next_sibling()
                text_file.write(str(text) + 'n')
            except:
                print("miss")

So far all I can do is print all siblings, which contains all times within it. For example:

<table class="table table-sm table-responsive table-hover"><tbody><tr id="rID_162222827"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(162222827)" ng-if="appC.params.edit=='mark' &amp;&amp; appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>9 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/AWi088nH1S1rZxdSN">2:10.97</a></td><td class="text-nowrap" style="width: 60px;">Mar 4</td><td><a href="meet/443782#53587">Sunset Invitational</a></td><td class="text-muted text-right text-nowrap">O F</td></tr><tr id="rID_164098252"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(164098252)" ng-if="appC.params.edit=='mark' &amp;&amp; appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>60 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/R3iEDYqsnSQ48l0h8">2:05.56</a><a href="/post/R3iEDYqsnSQ48l0h8" rel="nofollow"><small class="text-muted pr-text" style="font-weight:normal; margin-left: 4px;" uib-tooltip="Personal Record">PR</small></a></td><td class="text-nowrap" style="width: 60px;">Mar 19</td><td><a href="meet/441280#53587">Dublin Distance Fiesta</a></td><td class="text-muted text-right text-nowrap">V F</td></tr><tr id="rID_164212389"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(164212389)" ng-if="appC.params.edit=='mark' &amp;&amp; appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>3 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/16ik6rkINSN6VJEHy">2:18.54</a></td><td class="text-nowrap" style="width: 60px;">Mar 26</td><td><a href="meet/459101#53587">PSAL League Meet #1</a></td><td class="text-muted text-right text-nowrap">V F</td></tr><tr id="rID_174827223"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(174827223)" ng-if="appC.params.edit=='mark' &amp;&amp; appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>26 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/gBivaaKIVSZRLp2UA">2:10.58</a></td><td class="text-nowrap" style="width: 60px;">Apr 9</td><td><a href="meet/443768#53587">Cupertino/De Anza Invite</a></td><td class="text-muted text-right text-nowrap">V F</td></tr><tr id="rID_168470829"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(168470829)" ng-if="appC.params.edit=='mark' &amp;&amp; appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>50 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/vOi3ydru3SxDoBWso">2:13.20</a></td><td class="text-nowrap" style="width: 60px;">Apr 16</td><td><a href="meet/445132#53587">Granada Distance &amp; Sprint Festival</a></td><td class="text-muted text-right text-nowrap">O F</td></tr></tbody></table>

This output has all of the times for one event for this athlete in their most recent season.

How can I sift through to isolate only the times when there are various a tags that don’t contain times?

If I use find_next_sibling('a') it only prints None.

Advertisement

Answer

Question needs some improvment, focus and should provide expected output, it is not quite clear.

How can I sift through to isolate only the times when there are various ‘a’ tags that don’t contain times?

You could use css selectors to get all the <a> with time:

soup.select('tr [href^="/result"]')

or more specific

soup.select('tr td:nth-of-type(2) [href^="/result"]')
Example
from bs4 import BeautifulSoup

html = '''<table class="table table-sm table-responsive table-hover"><tbody><tr id="rID_162222827"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(162222827)" ng-if="appC.params.edit=='mark' &amp;&amp; appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>9 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/AWi088nH1S1rZxdSN">2:10.97</a></td><td class="text-nowrap" style="width: 60px;">Mar 4</td><td><a href="meet/443782#53587">Sunset Invitational</a></td><td class="text-muted text-right text-nowrap">O F</td></tr><tr id="rID_164098252"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(164098252)" ng-if="appC.params.edit=='mark' &amp;&amp; appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>60 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/R3iEDYqsnSQ48l0h8">2:05.56</a><a href="/post/R3iEDYqsnSQ48l0h8" rel="nofollow"><small class="text-muted pr-text" style="font-weight:normal; margin-left: 4px;" uib-tooltip="Personal Record">PR</small></a></td><td class="text-nowrap" style="width: 60px;">Mar 19</td><td><a href="meet/441280#53587">Dublin Distance Fiesta</a></td><td class="text-muted text-right text-nowrap">V F</td></tr><tr id="rID_164212389"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(164212389)" ng-if="appC.params.edit=='mark' &amp;&amp; appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>3 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/16ik6rkINSN6VJEHy">2:18.54</a></td><td class="text-nowrap" style="width: 60px;">Mar 26</td><td><a href="meet/459101#53587">PSAL League Meet #1</a></td><td class="text-muted text-right text-nowrap">V F</td></tr><tr id="rID_174827223"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(174827223)" ng-if="appC.params.edit=='mark' &amp;&amp; appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>26 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/gBivaaKIVSZRLp2UA">2:10.58</a></td><td class="text-nowrap" style="width: 60px;">Apr 9</td><td><a href="meet/443768#53587">Cupertino/De Anza Invite</a></td><td class="text-muted text-right text-nowrap">V F</td></tr><tr id="rID_168470829"><td class="text-nowrap" style="width:35px;"><a href="" ng-click="appC.edit(168470829)" ng-if="appC.params.edit=='mark' &amp;&amp; appC.params.canEdit"><i class="far fa-pencil text-primary mr-2"></i></a><i>50 </i></td><td style="width:110px;"><span class="mRight5 m-1"><i class="far fa-fw"></i></span><a href="/result/vOi3ydru3SxDoBWso">2:13.20</a></td><td class="text-nowrap" style="width: 60px;">Apr 16</td><td><a href="meet/445132#53587">Granada Distance &amp; Sprint Festival</a></td><td class="text-muted text-right text-nowrap">O F</td></tr></tbody></table>'''

soup = BeautifulSoup(html)

[t.text for t in soup.select('tr td:nth-of-type(2) [href^="/result"]')]
Output
['2:10.97', '2:05.56', '2:18.54', '2:10.58', '2:13.20']
User contributions licensed under: CC BY-SA
8 People found this is helpful
Advertisement