I am trying to make a simple crawler that extracts links from the “See About” section from this link https://en.wikipedia.org/wiki/Web_scraping. That is 19 links in total, which I have managed to extract using Beautiful Soup. However I get them as relative links in a list, which I also need to fix by making them into absolute links. Intended result would look like this:
Then I wanted to use those same 19 links and extract further information from them. For example the first paragraph from each of the 19 links. So far I have this:
import requests from bs4 import BeautifulSoup from urllib.parse import urljoin from urllib.request import urlopen url = 'https://en.wikipedia.org/wiki/Web_scraping' data = requests.get('https://en.wikipedia.org/wiki/Web_scraping').text soup = BeautifulSoup(data, 'html.parser') links = soup.find('div', {'class':'div-col'}) test = links.find_all('a', href=True) data = [] for link in links.find_all('a'): data.append(link.get('href')) #print(data) soupNew = BeautifulSoup(''.join(data), 'html.parser') print(soupNew.find_all('p')[0].text) #test if there is any <p> tag, which returns empty, so I have not looped correctly. x = soupNew.findAll('p') if x is not None and len(x) > 0: section = x[0] print(x)
My main issue is that I simply cant find a way to loop through the 19 links and look for the information I need. I am trying to learn Beautiful Soup and Python so I would prefer to stick with those for now even though there might be better options for doing this out there. So I just need some help or preferably an simple example to explain the process of doing said things above. Thanks!
Advertisement
Answer
You should split your code like you split your problems.
Your first problem was to get a list, so you could write a method called get_urls
def get_urls(): url = 'https://en.wikipedia.org/wiki/Web_scraping' data = requests.get(url).text soup = BeautifulSoup(data, 'html.parser') links = soup.find('div', {'class':'div-col'}) data = [] for link in links.find_all('a'): data.append("https://en.wikipedia.org"+link.get('href')) return data
You wanted to get the first paragraph of every url. With little research i just got this one
def get_first_paragraph(url): data = requests.get(url).text soup = BeautifulSoup(data, 'html.parser') return soup.p.text
now all has to be wired up
def iterate_through_urls(urls): for url in urls: print(get_first_paragraph(url)) def run(): urls = get_urls() iterate_through_urls(urls)