Skip to content
Advertisement

Python 404’ing on urllib.request

The basics of the code are below. I know for a fact how I’m retrieving these pages works for other URLs, as I just wrote a script scraping a different page in the same way. However with this specific URL it keeps throwing “urllib.error.HTTPError: HTTP Error 404: Not Found” in my face. I replaced the URL with a different one (https://www.premierleague.com/clubs), and it works completely fine. I’m very new to python so perhaps there’s a really basic step or piece of knowledge I haven’t found, but resources I’ve found on line relating to this didn’t seem relevant. Any advice would be great, thanks.

Below is the barebones of the script:

import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import csv

myurl = "https://www.transfermarkt.co.uk/premier-league/startseite/wettbewerb/GB1"

uClient = uReq(myurl)

Advertisement

Answer

The problem is most likely that the site you are trying to access is actively blocking spiders crawling; you can change the user agent to circumvent it. See this question for more information (the solution prescribed in that post seems to work for your url too).

If you want to use urllib this post tells you how to alter the user agent.

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement