Skip to content
Advertisement

How do I make a crawler extracting information from relative paths?

I am trying to make a simple crawler that extracts links from the “See About” section from this link https://en.wikipedia.org/wiki/Web_scraping. That is 19 links in total, which I have managed to extract using Beautiful Soup. However I get them as relative links in a list, which I also need to fix by making them into absolute links. Intended result would look like this: enter image description here

Then I wanted to use those same 19 links and extract further information from them. For example the first paragraph from each of the 19 links. So far I have this:

JavaScript

My main issue is that I simply cant find a way to loop through the 19 links and look for the information I need. I am trying to learn Beautiful Soup and Python so I would prefer to stick with those for now even though there might be better options for doing this out there. So I just need some help or preferably an simple example to explain the process of doing said things above. Thanks!

Advertisement

Answer

You should split your code like you split your problems.

  1. Your first problem was to get a list, so you could write a method called get_urls

    JavaScript
  2. You wanted to get the first paragraph of every url. With little research i just got this one

    JavaScript
  3. now all has to be wired up

    JavaScript
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement