Skip to content
Advertisement

How do I make a crawler extracting information from relative paths?

I am trying to make a simple crawler that extracts links from the “See About” section from this link https://en.wikipedia.org/wiki/Web_scraping. That is 19 links in total, which I have managed to extract using Beautiful Soup. However I get them as relative links in a list, which I also need to fix by making them into absolute links. Intended result would look like this: enter image description here

Then I wanted to use those same 19 links and extract further information from them. For example the first paragraph from each of the 19 links. So far I have this:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from urllib.request import urlopen

url = 'https://en.wikipedia.org/wiki/Web_scraping'
data = requests.get('https://en.wikipedia.org/wiki/Web_scraping').text

soup = BeautifulSoup(data, 'html.parser')

links = soup.find('div', {'class':'div-col'})
test = links.find_all('a', href=True)

data = []
for link in links.find_all('a'):
    data.append(link.get('href'))
#print(data)

soupNew = BeautifulSoup(''.join(data), 'html.parser')
print(soupNew.find_all('p')[0].text)

#test if there is any <p> tag, which returns empty, so I have not looped correctly.
x = soupNew.findAll('p')
if x is not None and len(x) > 0:
    section = x[0]
print(x)

My main issue is that I simply cant find a way to loop through the 19 links and look for the information I need. I am trying to learn Beautiful Soup and Python so I would prefer to stick with those for now even though there might be better options for doing this out there. So I just need some help or preferably an simple example to explain the process of doing said things above. Thanks!

Advertisement

Answer

You should split your code like you split your problems.

  1. Your first problem was to get a list, so you could write a method called get_urls

     def get_urls():
         url = 'https://en.wikipedia.org/wiki/Web_scraping'
         data = requests.get(url).text
         soup = BeautifulSoup(data, 'html.parser')
         links = soup.find('div', {'class':'div-col'})
         data = []
         for link in links.find_all('a'):
             data.append("https://en.wikipedia.org"+link.get('href'))
         return data
    
  2. You wanted to get the first paragraph of every url. With little research i just got this one

     def get_first_paragraph(url):
         data = requests.get(url).text
         soup = BeautifulSoup(data, 'html.parser')
         return soup.p.text
    
  3. now all has to be wired up

     def iterate_through_urls(urls):
         for url in urls:
             print(get_first_paragraph(url))
    
    
     def run():
         urls = get_urls()
         iterate_through_urls(urls)
    
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement