I’m trying to scrape word definitions, but can’t get python to redirect to the correct page. For example, I’m trying to get the definition for the word ‘agenesia’. When you load that page in a browser with https://www.lexico.com/definition/agenesia, the page which loads is https://www.lexico.com/definition/agenesis, however in Python the page doesn’t redirect and gives a 200 status code
URL = 'https://www.lexico.com/definition/agenesia' page = requests.head(URL, allow_redirects=True)
This is how I’m currently retrieving the page content, I’ve also tried using requests.get
but that also doesn’t work
EDIT: Because it isn’t clear, I’m aware that I could change the word to ‘agenesis’ in the URL to get the correct page, but I am scraping a list of words and would rather automatically follow the URL rather than searching in a browser for the redirect by hand first.
EDIT 2: I realised it might be easier to check solutions with the rest of my code, so far this works with agenesis
but not agenesia
:
soup = BeautifulSoup(page.content, 'html.parser') print(soup.find("span", {"class": "ind"}).get_text(), 'n') print(soup.find("span", {"class": "pos"}).get_text())
Advertisement
Answer
Other answers mentioned before doesn’t make your request redirect. The cause is you didn’t use the correct request header. Try code below:
import requests from bs4 import BeautifulSoup headers = { 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9', } page = requests.get('https://www.lexico.com/definition/agenesia', headers=headers) soup = BeautifulSoup(page.content, 'html.parser') print(page.url) print(soup.find("span", {"class": "ind"}).get_text(), 'n') print(soup.find("span", {"class": "pos"}).get_text())
And print:
https://www.lexico.com/definition/agenesis?s=t Failure of development, or incomplete development, of a part of the body. noun