I am working on a project that involves analyzing the text of political emails from this website: https://politicalemails.org/. I am attempting to scrape all the emails using BeautifulSoup and pandas. I have a working chunk right here:
#Import libraries import numpy as np import requests from bs4 import BeautifulSoup import pandas as pd #Check if scraping is allowed url = 'https://politicalemails.org/messages' page = requests.get(url) page #Prepare empty dataframe df = pd.DataFrame( { 'sender':[''], 'subject':[''], 'date':[''], 'body':[''] } ) #Loop through emails and scrape url_base = 'https://politicalemails.org/messages/' #email_pages=50 for i in range(2,5): #for i in range(email_pages): url_full = url_base+str(i) page = requests.get(url_full) soup = BeautifulSoup(page.text,'lxml') email = soup.find_all('td',class_='content-box-meta__value') message = soup.find_all('div',class_='message-text') sender = email[0].text.strip() subject = email[1].text.strip() date = email[2].text.strip() body = message[0].text.strip() df = df.append({ 'sender':sender, 'subject':subject, 'date':date, 'body':body },ignore_index=True) df.head()
The above results in pulling the data I want. However, I want to loop through larger chunks of the emails in this archive. Just checking out either one of the following links:
print(url_base+str(0)) print(url_base+str(1)) print(url_base+str(100))
results in a ‘404 Not Found’ error. How can I build a “skip” logic that sees if there is no information to scrape from the website and then moves on to the next iteration? If I used the commented out chunk of code with the email_pages = 50
, I will get an error that reads:
IndexError: list index out of range
How should I approach editing my for loop to account for this behavior?
Advertisement
Answer
I’d advise using a switch case for situations like these.
match page.status_code: case 404: continue
If your Python version does not support switch-case statements, You could do just the same with an if-else
clause.
if page.status_code == 404: continue
continue
instructs it to move to the next iteration, allowing you to skip the rest of the code since there are no resources retrieved in the request.