Issue with web scraping from website for capturing pagination links

Question

I am trying to scrape data from all listed category URL&#8217;s on Home page (Done) and further sub category pages from the website and its Pagination links as well. URL is here I have created Python script for the same to extract data in Modular structure as I need Output from all URL&#8217;s from one step t…

Accepted Answer

You actually don&#8217;t need Selenium for this at all. The code below will fetch categories, sub-categories and item-links, names and description for everything on the site.The only tricky part is the while-loop that handles the pagination. The principle is that if there&#8217;s a &#8220;next&#8221;-button present on the site, we&#8217;ll need to load more content. In this case the sites actually gives us the &#8220;next&#8221;-link in the next-tag, so its easy to iterate through until there are no more next-links to retrieve.Keep in mind tho, when you run this, that it might take a while. Keep in mind too, that you probably should insert a sleep &#8211; e.g. at 1 second &#8211; between each request in the while loop to treat the server nicely.Doing so would reduce you risk of getting banned/something similar.import requestsfrom bs4 import BeautifulSoupfrom time import sleepitems_list = [] # list of dictionaries with this content: category, sub_category, item_description, item_name, item_link r = requests.get("http://www.medicalexpo.com/")soup = BeautifulSoup(r.text, "lxml")cat_items = soup.find_all('li', class_="category-group-item")cat_items = [[cat_item.get_text().strip(),cat_item.a.get('href')] for cat_item in cat_items]# cat_items is now a list with elements like this:# ['General practice','http://www.medicalexpo.com/cat/general-practice-K.html']# to access the next level, we loop:for category, category_link in cat_items[:1]:    print("[*] Extracting data for category: {}".format(category))    r = requests.get("http://www.medicalexpo.com/cat/general-practice-K.html")    soup = BeautifulSoup(r.text, "lxml")    # data of all sub_categories are located in an element with the id 'category-group'    cat_group = soup.find('div', attrs={'id': 'category-group'})    # the data lie in 'li'-tags    li_elements = cat_group.find_all('li')    sub_links = [[li.a.get('href'), li.get_text().strip()] for li in li_elements]    # sub_links is now a list og elements like this:    # ['http://www.medicalexpo.com/medical-manufacturer/stethoscope-2.html', 'Stethoscopes']    # to access the last level we need to dig further in with a loop    for sub_category_link, sub_category in sub_links:        print("  [-] Extracting data for sub_category: {}".format(sub_category))        local_count = 0        load_page = True        item_url = sub_category_link        while load_page:            print("     [-] Extracting data for item_url: {}".format(item_url))            r = requests.get(item_url)            soup = BeautifulSoup(r.text, "lxml")            item_links = soup.find_all('div', class_="inset-caption price-container")[2:]            for item in item_links:                item_name = item.a.get_text().strip().split('n')[0]                item_link = item.a.get('href')                try:                    item_description = item.a.get_text().strip().split('n')[1]                except:                    item_description = None                item_dict = {                    "category": category,                    "subcategory": sub_category,                    "item_name": item_name,                    "item_link": item_link,                    "item_description": item_description                }                items_list.append(item_dict)                local_count +=1            # all itempages has a pagination element            # if there are more pages to load, it will have a "next"-class            # if we are on the last page, the will not be a next class and "next_link" will return None            pagination = soup.find(class_="pagination-wrapper")            try:                next_link = pagination.find(class_="next").get('href', None)            except:                next_link = None            # consider inserting a sleep(1) right about here...            # if the next_link exists it means that there are more pages to load            # we'll then set the item_url = next_link and the While-loop will continue            if next_link is not None:                item_url = next_link            else:                load_page = False        print("      [-] a total of {} item_links extracted for this sub_category".format(local_count))# this will yield a list of dicts like this one:# {'category': 'General practice',#  'item_description': 'Flac duo',#  'item_link': 'http://www.medicalexpo.com/prod/boso-bosch-sohn/product-67891-821119.html',#  'item_name': 'single-head stethoscope',#  'subcategory': 'Stethoscopes'}# If you need to export to something like excel, uses pandas. Create a DataFrame and simple load it with the list# pandas can the export the stuff to excel easily...

Advertisement

Answer