I am writing a code that is supposed to open a url, identify the 3rd link and repeat this process 3 times (each time with the new url).
I wrote a loop (below), but it seems to each time sart over with the original url.
Can someone help me fix my code?
import urllib.request, urllib.parse, urllib.error from urllib.parse import urljoin from bs4 import BeautifulSoup #blanc list l = [] #starting url url = input('Enter URL: ') if len(url) < 1: url = 'http://py4e-data.dr-chuck.net/known_by_Fikret.html' #loop for _ in range(4): html = urllib.request.urlopen(url).read() #open url soup = BeautifulSoup(html, 'html.parser') #parse through BeautifulSoup tags = soup('a') #extract tags for tag in tags: url = tag.get('href', None) #extract links from tags l.append(url) #add the links to a list url = l[2:3] #slice the list to extract the 3rd url url = ' '.join(str(e) for e in url) #change the type to string print(url) Current Output: http://py4e-data.dr-chuck.net/known_by_Montgomery.html http://py4e-data.dr-chuck.net/known_by_Montgomery.html http://py4e-data.dr-chuck.net/known_by_Montgomery.html http://py4e-data.dr-chuck.net/known_by_Montgomery.html Desired output: http://py4e-data.dr-chuck.net/known_by_Montgomery.html http://py4e-data.dr-chuck.net/known_by_Mhairade.html http://py4e-data.dr-chuck.net/known_by_Butchi.html http://py4e-data.dr-chuck.net/known_by_Anayah.html
Advertisement
Answer
You need to define the empty list within the loop. The following code works:
import urllib.request, urllib.parse, urllib.error from urllib.parse import urljoin from bs4 import BeautifulSoup #blanc list # l = [] #starting url url = input('Enter URL: ') if len(url) < 1: url = 'http://py4e-data.dr-chuck.net/known_by_Fikret.html' #loop for _ in range(4): l = [] html = urllib.request.urlopen(url).read() #open url soup = BeautifulSoup(html, 'html.parser') #parse through BeautifulSoup tags = soup('a') #extract tags for tag in tags: url = tag.get('href', None) #extract links from tags l.append(url) #add the links to a list url = l[2:3] #slice the list to extract the 3rd url url = ' '.join(str(e) for e in url) #change the type to string print(url)
Result in terminal:
http://py4e-data.dr-chuck.net/known_by_Montgomery.html http://py4e-data.dr-chuck.net/known_by_Mhairade.html http://py4e-data.dr-chuck.net/known_by_Butchi.html http://py4e-data.dr-chuck.net/known_by_Anayah.html