Skip to content
Advertisement

Python Web Scraping – How to Skip Over Missing Entries?

I am working on a project that involves analyzing the text of political emails from this website: https://politicalemails.org/. I am attempting to scrape all the emails using BeautifulSoup and pandas. I have a working chunk right here:

#Import libraries
import numpy as np
import requests
from bs4 import BeautifulSoup
import pandas as pd

#Check if scraping is allowed
url = 'https://politicalemails.org/messages'
page = requests.get(url)
page


#Prepare empty dataframe
df = pd.DataFrame(
    {
        'sender':[''], 
        'subject':[''],
        'date':[''],
        'body':['']
    }
)


#Loop through emails and scrape
url_base = 'https://politicalemails.org/messages/'
#email_pages=50

for i in range(2,5):
#for i in range(email_pages):

    url_full = url_base+str(i)
    page = requests.get(url_full)

    soup = BeautifulSoup(page.text,'lxml')
    email = soup.find_all('td',class_='content-box-meta__value')
    message = soup.find_all('div',class_='message-text')

    sender = email[0].text.strip()
    subject = email[1].text.strip()
    date = email[2].text.strip()

    body = message[0].text.strip()
    

    df = df.append({
            'sender':sender, 
            'subject':subject,
            'date':date,
            'body':body
    },ignore_index=True)

df.head()

The above results in pulling the data I want. However, I want to loop through larger chunks of the emails in this archive. Just checking out either one of the following links:

print(url_base+str(0))
print(url_base+str(1))
print(url_base+str(100))

results in a ‘404 Not Found’ error. How can I build a “skip” logic that sees if there is no information to scrape from the website and then moves on to the next iteration? If I used the commented out chunk of code with the email_pages = 50, I will get an error that reads:

IndexError: list index out of range

How should I approach editing my for loop to account for this behavior?

Advertisement

Answer

I’d advise using a switch case for situations like these.

match page.status_code:
  case 404:
    continue

If your Python version does not support switch-case statements, You could do just the same with an if-else clause.

if page.status_code == 404:
  continue

continue instructs it to move to the next iteration, allowing you to skip the rest of the code since there are no resources retrieved in the request.

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement