I have to write a program that will read the HTML from this link(http://python-data.dr-chuck.net/known_by_Maira.html), extract the href= values from the anchor tags, scan for a tag that is in a particular position relative to the first name in the list, follow that link and repeat the process a number of times and report the last name you find.
I am supposed to find the link at position 18 (the first name is 1), follow that link and repeat this process 7 times. The answer is the last name that I retrieve.
Here is the code I found and it works just fine.
JavaScript
x
22
22
1
import urllib
2
3
from BeautifulSoup import *
4
5
url = raw_input("Enter URL: ")
6
count = int(raw_input("Enter count: "))
7
position = int(raw_input("Enter position: "))
8
9
names = []
10
11
while count > 0:
12
print "retrieving: {0}".format(url)
13
page = urllib.urlopen(url)
14
soup = BeautifulSoup(page)
15
tag = soup('a')
16
name = tag[position-1].string
17
names.append(name)
18
url = tag[position-1]['href']
19
count -= 1
20
21
print names[-1]
22
I would really appreciate if someone could explain to me like you would to a 10 year old, what’s going on inside the while loop. I am new to Python and would really appreciate the guidance.
Advertisement
Answer
JavaScript
1
33
33
1
while count > 0: # because of `count -= 1` below,
2
# will run loop count times
3
4
print "retrieving: {0}".format(url) # just prints out the next web page
5
# you are going to get
6
7
page = urllib.urlopen(url) # urls reference web pages (well,
8
# many types of web content but
9
# we'll stick with web pages)
10
11
soup = BeautifulSoup(page) # web pages are frequently written
12
# in html which can be messy. this
13
# package "unmessifies" it
14
15
tag = soup('a') # in html you can highlight text and
16
# reference other web pages with <a>
17
# tags. this get all of the <a> tags
18
# in a list
19
20
name = tag[position-1].string # This gets the <a> tag at position-1
21
# and then gets its text value
22
23
names.append(name) # this puts that value in your own
24
# list.
25
26
url = tag[position-1]['href'] # html tags can have attributes. On
27
# and <a> tag, the href="something"
28
# attribute references another web
29
# page. You store it in `url` so that
30
# its the page you grab on the next
31
# iteration of the loop.
32
count -= 1
33