I am working on a project that involves analyzing the text of political emails from this website: https://politicalemails.org/. I am attempting to scrape all the emails using BeautifulSoup and pandas. I have a working chunk right here:
#Import libraries
import numpy as np
import requests
from bs4 import BeautifulSoup
import pandas as pd
#Check if scraping is allowed
url = 'https://politicalemails.org/messages'
page = requests.get(url)
page
#Prepare empty dataframe
df = pd.DataFrame(
    {
        'sender':[''], 
        'subject':[''],
        'date':[''],
        'body':['']
    }
)
#Loop through emails and scrape
url_base = 'https://politicalemails.org/messages/'
#email_pages=50
for i in range(2,5):
#for i in range(email_pages):
    url_full = url_base+str(i)
    page = requests.get(url_full)
    soup = BeautifulSoup(page.text,'lxml')
    email = soup.find_all('td',class_='content-box-meta__value')
    message = soup.find_all('div',class_='message-text')
    sender = email[0].text.strip()
    subject = email[1].text.strip()
    date = email[2].text.strip()
    body = message[0].text.strip()
    
    df = df.append({
            'sender':sender, 
            'subject':subject,
            'date':date,
            'body':body
    },ignore_index=True)
df.head()
The above results in pulling the data I want. However, I want to loop through larger chunks of the emails in this archive. Just checking out either one of the following links:
print(url_base+str(0)) print(url_base+str(1)) print(url_base+str(100))
results in a ‘404 Not Found’ error.  How can I build a “skip” logic that sees if there is no information to scrape from the website and then moves on to the next iteration?  If I used the commented out chunk of code with the email_pages = 50, I will get an error that reads:
IndexError: list index out of range
How should I approach editing my for loop to account for this behavior?
Advertisement
Answer
I’d advise using a switch case for situations like these.
match page.status_code:
  case 404:
    continue
If your Python version does not support switch-case statements, You could do just the same with an if-else clause.
if page.status_code == 404: continue
continue instructs it to move to the next iteration, allowing you to skip the rest of the code since there are no resources retrieved in the request.