I have to write a program that will read the HTML from this link(http://python-data.dr-chuck.net/known_by_Maira.html), extract the href= values from the anchor tags, scan for a tag that is in a particular position relative to the first name in the list, follow that link and repeat the process a number of times and report the last name you find.
I am supposed to find the link at position 18 (the first name is 1), follow that link and repeat this process 7 times. The answer is the last name that I retrieve.
Here is the code I found and it works just fine.
import urllib from BeautifulSoup import * url = raw_input("Enter URL: ") count = int(raw_input("Enter count: ")) position = int(raw_input("Enter position: ")) names = [] while count > 0: print "retrieving: {0}".format(url) page = urllib.urlopen(url) soup = BeautifulSoup(page) tag = soup('a') name = tag[position-1].string names.append(name) url = tag[position-1]['href'] count -= 1 print names[-1]
I would really appreciate if someone could explain to me like you would to a 10 year old, what’s going on inside the while loop. I am new to Python and would really appreciate the guidance.
Advertisement
Answer
while count > 0: # because of `count -= 1` below, # will run loop count times print "retrieving: {0}".format(url) # just prints out the next web page # you are going to get page = urllib.urlopen(url) # urls reference web pages (well, # many types of web content but # we'll stick with web pages) soup = BeautifulSoup(page) # web pages are frequently written # in html which can be messy. this # package "unmessifies" it tag = soup('a') # in html you can highlight text and # reference other web pages with <a> # tags. this get all of the <a> tags # in a list name = tag[position-1].string # This gets the <a> tag at position-1 # and then gets its text value names.append(name) # this puts that value in your own # list. url = tag[position-1]['href'] # html tags can have attributes. On # and <a> tag, the href="something" # attribute references another web # page. You store it in `url` so that # its the page you grab on the next # iteration of the loop. count -= 1