I’m reading the book, Web Scraping with Python which has the following function to retrieve external links found on a page:
#Retrieves a list of all external links found on a page def getExternalLinks(bs, excludeUrl): externalLinks = [] #Finds all links that start with "http" that do #not contain the current URL for link in bs.find_all('a', {'href' : re.compile('^(http|www)((?!'+excludeUrl+').)*$')}): if link.attrs['href'] is not None: if link.attrs['href'] not in externalLinks: externalLinks.append(link.attrs['href']) return externalLinks
The problem is that it does not work the way it should. When i run it using the URL: http://www.oreilly.com, it returns this:
bs = makeSoup('https://www.oreilly.com') # Makes a BeautifulSoup Object getExternalLinks(bs, 'https://www.oreilly.com')
Output:
['https://www.oreilly.com', 'https://oreilly.com/sign-in.html', 'https://oreilly.com/online-learning/try-now.html', 'https://oreilly.com/online-learning/index.html', 'https://oreilly.com/online-learning/individuals.html', 'https://oreilly.com/online-learning/teams.html', 'https://oreilly.com/online-learning/enterprise.html', 'https://oreilly.com/online-learning/government.html', 'https://oreilly.com/online-learning/academic.html', 'https://oreilly.com/online-learning/pricing.html', 'https://www.oreilly.com/partner/reseller-program.html', 'https://oreilly.com/conferences/', 'https://oreilly.com/ideas/', 'https://oreilly.com/about/approach.html', 'https://www.oreilly.com/conferences/', 'https://conferences.oreilly.com/velocity/vl-ny', 'https://conferences.oreilly.com/artificial-intelligence/ai-eu', 'https://www.safaribooksonline.com/public/free-trial/', 'https://www.safaribooksonline.com/team-setup/', 'https://www.oreilly.com/online-learning/enterprise.html', 'https://www.oreilly.com/about/approach.html', 'https://conferences.oreilly.com/software-architecture/sa-eu', 'https://conferences.oreilly.com/velocity/vl-eu', 'https://conferences.oreilly.com/software-architecture/sa-ny', 'https://conferences.oreilly.com/strata/strata-ca', 'http://shop.oreilly.com/category/customer-service.do', 'https://twitter.com/oreillymedia', 'https://www.facebook.com/OReilly/', 'https://www.linkedin.com/company/oreilly-media', 'https://www.youtube.com/user/OreillyMedia', 'https://www.oreilly.com/emails/newsletters/', 'https://itunes.apple.com/us/app/safari-to-go/id881697395', 'https://play.google.com/store/apps/details?id=com.safariflow.queue']
Question:
Why are the first 16-17 entries considered “external links”? They belong to the same domain of http://www.oreilly.com.
Advertisement
Answer
import urllib from bs4 import BeautifulSoup from urllib.request import urlopen from urllib.parse import urlsplit import re ext = set() def getExt(url): o = urllib.parse.urlsplit(url) html = urlopen(url) bs = BeautifulSoup(html, 'html.parser') for link in bs.find_all('a', href = re.compile('^((https://)|(http://))')): if 'href' in link.attrs: if o.netloc in (link.attrs['href']): continue else: ext.add(link.attrs['href']) getExt('https://oreilly.com/') for i in ext: print(i)