Skip to content
Advertisement

Following links in Python

I have to write a program that will read the HTML from this link(http://python-data.dr-chuck.net/known_by_Maira.html), extract the href= values from the anchor tags, scan for a tag that is in a particular position relative to the first name in the list, follow that link and repeat the process a number of times and report the last name you find.

I am supposed to find the link at position 18 (the first name is 1), follow that link and repeat this process 7 times. The answer is the last name that I retrieve.

Here is the code I found and it works just fine.

import urllib

from BeautifulSoup import *

url = raw_input("Enter URL: ")
count = int(raw_input("Enter count: "))
position = int(raw_input("Enter position: "))

names = []

while count > 0:
    print "retrieving: {0}".format(url)
    page = urllib.urlopen(url)
    soup = BeautifulSoup(page)
    tag = soup('a')
    name = tag[position-1].string
    names.append(name)
    url = tag[position-1]['href']
    count -= 1

print names[-1]

I would really appreciate if someone could explain to me like you would to a 10 year old, what’s going on inside the while loop. I am new to Python and would really appreciate the guidance.

Advertisement

Answer

while count > 0:                         # because of `count -= 1` below,
                                         # will run loop count times

    print "retrieving: {0}".format(url)  # just prints out the next web page
                                         # you are going to get

    page = urllib.urlopen(url)           # urls reference web pages (well,
                                         # many types of web content but
                                         # we'll stick with web pages)

    soup = BeautifulSoup(page)           # web pages are frequently written
                                         # in html which can be messy. this
                                         # package "unmessifies" it

    tag = soup('a')                      # in html you can highlight text and
                                         # reference other web pages with <a>
                                         # tags. this get all of the <a> tags
                                         # in a list

    name = tag[position-1].string        # This gets the <a> tag at position-1
                                         # and then gets its text value

    names.append(name)                   # this puts that value in your own
                                         # list.

    url = tag[position-1]['href']        # html tags can have attributes. On
                                         # and <a> tag, the href="something"
                                         # attribute references another web
                                         # page. You store it in `url` so that
                                         # its the page you grab on the next
                                         # iteration of the loop.
    count -= 1
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement