Skip to content

Web scraping: Index out of Bound (Possible scaling error)

Hi Wrote a web scraping program and it gets the ASN number correctly, but after all the data is scraped, it returns a error “Array Out if Bounds”.

I am using Pycharm and latest python version. Below is my code. There is already a similar issue on stackoverflow but I am not able to get the pieces together and make it work. (Web Scraping List Index Out Of Range) its the exact same error but I am not sure how to get it working for my List.

Error seems to be at current_country = link.split(‘/’)[2] Any help is appreciated. Thank you.

import urllib.request
import bs4
import re
import json

url = ''
SITE = ''

def url_to_soup(url):
    req = urllib.request.Request(url)
    opener = urllib.request.build_opener()
    html =
    soup = bs4.BeautifulSoup(html, "html.parser")
    return soup

def find_pages(page):
    pages = []
    for link in page.find_all(href=re.compile('/countries')):
    return pages

def scrape_pages(links):
    mappings = {}

    print("Scraping Pages for ASN Data...")

    for link in links:
        country_page = url_to_soup(SITE + link)
        current_country = link.split('/')[2]
        for row in country_page.find_all('tr'):
            columns = row.find_all('td')
            if len(columns) > 0:
                current_asn = re.findall(r'd+', columns[0].string)[0]
                name = columns[1].string
                routes_v4 = columns[3].string
                routes_v6 = columns[5].string
                mappings[current_asn] = {'Country': current_country,
                                     'Name': name,
                                     'Routes v4': routes_v4,
                                     'Routes v6': routes_v6}
      return mappings """

main_page = url_to_soup(url)

country_links = find_pages(main_page)

asn_mappings = scrape_pages(country_links)



The last href contains string “/countries” in is actually “/countries“:

<li><a href="/countries">Global ASNs</a></li>

After splitting this link, it produced list ["", "countries"] where the third element was missing. To fix this problem, simply check the list length before retrieving the third element:

        current_country = link.split('/')
        if len(current_country) < 3:
        current_country = current_country[2]

Another solution is to exclude the last href by changing the regexp to:

    for link in page.find_all(href=re.compile('/countries/')):
User contributions licensed under: CC BY-SA
4 People found this is helpful