Skip to content
Advertisement

Looping through HTML & following links

I am writing a code that is supposed to open a url, identify the 3rd link and repeat this process 3 times (each time with the new url).

I wrote a loop (below), but it seems to each time sart over with the original url.

Can someone help me fix my code?

import urllib.request, urllib.parse, urllib.error
from urllib.parse import urljoin
from bs4 import BeautifulSoup

#blanc list
l = []

#starting url
url = input('Enter URL: ')
if len(url) < 1:
    url = 'http://py4e-data.dr-chuck.net/known_by_Fikret.html'

#loop 
for _ in range(4):
    html = urllib.request.urlopen(url).read()    #open url
    soup = BeautifulSoup(html, 'html.parser')    #parse through BeautifulSoup
    tags = soup('a')    #extract tags
    
    for tag in tags:
        url = tag.get('href', None)    #extract links from tags
        l.append(url)    #add the links to a list
        url = l[2:3]    #slice the list to extract the 3rd url
        url = ' '.join(str(e) for e in url)    #change the type to string
    print(url)

Current Output: 
http://py4e-data.dr-chuck.net/known_by_Montgomery.html
http://py4e-data.dr-chuck.net/known_by_Montgomery.html
http://py4e-data.dr-chuck.net/known_by_Montgomery.html
http://py4e-data.dr-chuck.net/known_by_Montgomery.html

Desired output:
http://py4e-data.dr-chuck.net/known_by_Montgomery.html
http://py4e-data.dr-chuck.net/known_by_Mhairade.html
http://py4e-data.dr-chuck.net/known_by_Butchi.html
http://py4e-data.dr-chuck.net/known_by_Anayah.html

Advertisement

Answer

You need to define the empty list within the loop. The following code works:

import urllib.request, urllib.parse, urllib.error
from urllib.parse import urljoin
from bs4 import BeautifulSoup

#blanc list
# l = []

#starting url
url = input('Enter URL: ')
if len(url) < 1:
    url = 'http://py4e-data.dr-chuck.net/known_by_Fikret.html'

#loop 
for _ in range(4):
    l = []
    html = urllib.request.urlopen(url).read()    #open url
    soup = BeautifulSoup(html, 'html.parser')    #parse through BeautifulSoup
    tags = soup('a')    #extract tags
    
    for tag in tags:
        url = tag.get('href', None)    #extract links from tags
        l.append(url)    #add the links to a list
        url = l[2:3]    #slice the list to extract the 3rd url
        url = ' '.join(str(e) for e in url)    #change the type to string
    print(url)

Result in terminal:

http://py4e-data.dr-chuck.net/known_by_Montgomery.html
http://py4e-data.dr-chuck.net/known_by_Mhairade.html
http://py4e-data.dr-chuck.net/known_by_Butchi.html
http://py4e-data.dr-chuck.net/known_by_Anayah.html
Advertisement