Skip to content
Advertisement

Issue with web scraping from website for capturing pagination links

I am trying to scrape data from all listed category URL’s on Home page (Done) and further sub category pages from the website and its Pagination links as well. URL is here

I have created Python script for the same to extract data in Modular structure as I need Output from all URL’s from one step to another in a separate file. But right now I am facing issue with extraction of all pagination URL’s from which data will be fetched afterwards. Also, Instead of data from all listed Sub category URLs I am getting data from First Sub category URL only.

For example in my below script, data from >>>>>

General Practice (Main Category page) – http://www.medicalexpo.com/cat/general-practice-K.html and further Stethoscope (Sub category page) – http://www.medicalexpo.com/medical-manufacturer/stethoscope-2.html

is coming only. I want data from all listed Sub category links as given on this link

Any help would be appreciated to get me desired output having PRODUCT URLs from all listed sub category pages.

Below is the code:

JavaScript

Advertisement

Answer

You actually don’t need Selenium for this at all. The code below will fetch categories, sub-categories and item-links, names and description for everything on the site.

The only tricky part is the while-loop that handles the pagination. The principle is that if there’s a “next”-button present on the site, we’ll need to load more content. In this case the sites actually gives us the “next”-link in the next-tag, so its easy to iterate through until there are no more next-links to retrieve.

Keep in mind tho, when you run this, that it might take a while. Keep in mind too, that you probably should insert a sleep – e.g. at 1 second – between each request in the while loop to treat the server nicely.

Doing so would reduce you risk of getting banned/something similar.

JavaScript
User contributions licensed under: CC BY-SA
10 People found this is helpful
Advertisement