How do I make a crawler extracting information from relative paths?

I am trying to make a simple crawler that extracts links from the “See About” section from this link https://en.wikipedia.org/wiki/Web_scraping. That is 19 links in total, which I have managed to extract using Beautiful Soup. However I get them as relative links in a list, which I also need to fix by making them into absolute links. Intended result would look like this:

Then I wanted to use those same 19 links and extract further information from them. For example the first paragraph from each of the 19 links. So far I have this:

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from urllib.request import urlopen

url = 'https://en.wikipedia.org/wiki/Web_scraping'
data = requests.get('https://en.wikipedia.org/wiki/Web_scraping').text

soup = BeautifulSoup(data, 'html.parser')

links = soup.find('div', {'class':'div-col'})
test = links.find_all('a', href=True)

data = []
for link in links.find_all('a'):
    data.append(link.get('href'))
#print(data)

soupNew = BeautifulSoup(''.join(data), 'html.parser')
print(soupNew.find_all('p')[0].text)

#test if there is any <p> tag, which returns empty, so I have not looped correctly.
x = soupNew.findAll('p')
if x is not None and len(x) > 0:
    section = x[0]
print(x)

JavaScript
​x
 
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from urllib.request import urlopen
​
url = 'https://en.wikipedia.org/wiki/Web_scraping'
data = requests.get('https://en.wikipedia.org/wiki/Web_scraping').text
​
soup = BeautifulSoup(data, 'html.parser')
​
links = soup.find('div', {'class':'div-col'})
test = links.find_all('a', href=True)
​
data = []
for link in links.find_all('a'):
    data.append(link.get('href'))
#print(data)
​
soupNew = BeautifulSoup(''.join(data), 'html.parser')
print(soupNew.find_all('p')[0].text)
​
#test if there is any <p> tag, which returns empty, so I have not looped correctly.
x = soupNew.findAll('p')
if x is not None and len(x) > 0:
    section = x[0]
print(x)
​

My main issue is that I simply cant find a way to loop through the 19 links and look for the information I need. I am trying to learn Beautiful Soup and Python so I would prefer to stick with those for now even though there might be better options for doing this out there. So I just need some help or preferably an simple example to explain the process of doing said things above. Thanks!

Answer

You should split your code like you split your problems.

Your first problem was to get a list, so you could write a method called get_urls

 def get_urls():
     url = 'https://en.wikipedia.org/wiki/Web_scraping'
     data = requests.get(url).text
     soup = BeautifulSoup(data, 'html.parser')
     links = soup.find('div', {'class':'div-col'})
     data = []
     for link in links.find_all('a'):
         data.append("https://en.wikipedia.org"+link.get('href'))
     return data

JavaScript
 
 def get_urls():
     url = 'https://en.wikipedia.org/wiki/Web_scraping'
     data = requests.get(url).text
     soup = BeautifulSoup(data, 'html.parser')
     links = soup.find('div', {'class':'div-col'})
     data = []
     for link in links.find_all('a'):
         data.append("https://en.wikipedia.org"+link.get('href'))
     return data
​

You wanted to get the first paragraph of every url. With little research i just got this one

 def get_first_paragraph(url):
     data = requests.get(url).text
     soup = BeautifulSoup(data, 'html.parser')
     return soup.p.text

JavaScript
 
 def get_first_paragraph(url):
     data = requests.get(url).text
     soup = BeautifulSoup(data, 'html.parser')
     return soup.p.text
​

now all has to be wired up

 def iterate_through_urls(urls):
     for url in urls:
         print(get_first_paragraph(url))


 def run():
     urls = get_urls()
     iterate_through_urls(urls)

JavaScript
 
 def iterate_through_urls(urls):
     for url in urls:
         print(get_first_paragraph(url))
​
​
 def run():
     urls = get_urls()
     iterate_through_urls(urls)
​

Advertisement

Answer