Python

I am working on a project that involves analyzing the text of political emails from this website: https://politicalemails.org/. I am attempting to scrape all the emails using BeautifulSoup and pandas. I have a working chunk right here:

#Import libraries
import numpy as np
import requests
from bs4 import BeautifulSoup
import pandas as pd

#Check if scraping is allowed
url = 'https://politicalemails.org/messages'
page = requests.get(url)
page


#Prepare empty dataframe
df = pd.DataFrame(
    {
        'sender':[''], 
        'subject':[''],
        'date':[''],
        'body':['']
    }
)


#Loop through emails and scrape
url_base = 'https://politicalemails.org/messages/'
#email_pages=50

for i in range(2,5):
#for i in range(email_pages):

    url_full = url_base+str(i)
    page = requests.get(url_full)

    soup = BeautifulSoup(page.text,'lxml')
    email = soup.find_all('td',class_='content-box-meta__value')
    message = soup.find_all('div',class_='message-text')

    sender = email[0].text.strip()
    subject = email[1].text.strip()
    date = email[2].text.strip()

    body = message[0].text.strip()
    

    df = df.append({
            'sender':sender, 
            'subject':subject,
            'date':date,
            'body':body
    },ignore_index=True)

df.head()

JavaScript
​x
 
#Import libraries
import numpy as np
import requests
from bs4 import BeautifulSoup
import pandas as pd
​
#Check if scraping is allowed
url = 'https://politicalemails.org/messages'
page = requests.get(url)
page
​
​
#Prepare empty dataframe
df = pd.DataFrame(
    {
        'sender':[''], 
        'subject':[''],
        'date':[''],
        'body':['']
    }
)
​
​
#Loop through emails and scrape
url_base = 'https://politicalemails.org/messages/'
#email_pages=50
​
for i in range(2,5):
#for i in range(email_pages):
​
    url_full = url_base+str(i)
    page = requests.get(url_full)
​
    soup = BeautifulSoup(page.text,'lxml')
    email = soup.find_all('td',class_='content-box-meta__value')
    message = soup.find_all('div',class_='message-text')
​
    sender = email[0].text.strip()
    subject = email[1].text.strip()
    date = email[2].text.strip()
​
    body = message[0].text.strip()
    
​
    df = df.append({
            'sender':sender, 
            'subject':subject,
            'date':date,
            'body':body
    },ignore_index=True)
​
df.head()
​

The above results in pulling the data I want. However, I want to loop through larger chunks of the emails in this archive. Just checking out either one of the following links:

print(url_base+str(0))
print(url_base+str(1))
print(url_base+str(100))

JavaScript
 
print(url_base+str(0))
print(url_base+str(1))
print(url_base+str(100))
​

results in a ‘404 Not Found’ error. How can I build a “skip” logic that sees if there is no information to scrape from the website and then moves on to the next iteration? If I used the commented out chunk of code with the email_pages = 50, I will get an error that reads:

IndexError: list index out of range

JavaScript
 
IndexError: list index out of range
​

How should I approach editing my for loop to account for this behavior?

Answer

I’d advise using a switch case for situations like these.

match page.status_code:
  case 404:
    continue

JavaScript
 
match page.status_code:
  case 404:
    continue
​

If your Python version does not support switch-case statements, You could do just the same with an if-else clause.

if page.status_code == 404:
  continue

JavaScript
 
if page.status_code == 404:
  continue
​

continue instructs it to move to the next iteration, allowing you to skip the rest of the code since there are no resources retrieved in the request.

Python Web Scraping – How to Skip Over Missing Entries?

Advertisement

Answer