Trying to retrieve the page source from a website, I get a completely different (and shorter) text than when viewing the same page source through a web browser.
This fellow has a related issue, but obtained the home page source instead of the requested one – I am getting something completely alien.
The code is:
from urllib import request def get_page_source(n): url = 'https://www.whoscored.com/Matches/' + str(n) + '/live' response = request.urlopen(url) return str(response.read()) n = 1006233 text = get_page_source(n)
This is the page I am targeting in this example: https://www.whoscored.com/Matches/1006233/live
The url in question contains rich information in the page source, but I end up getting only the following when running the above code:
text =
b'<html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="initial-scale=1.0"><meta http-equiv="X- UA-Compatible" content="IE=edge,chrome=1"></head><body style="margin:0px; height:100%"><iframe src="/_Incapsula_Resource?CWUDNSAI=24& xinfo=0-12919260-0 0NNY RT(1462118673272 111) q(0 -1 -1 -1) r(0 -1) B12(4,315,0) U2&incident_id=276000100045095595-100029307305590944&edet=12& cinfo=04000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 276000100045095595-100029307305590944</iframe></body></html>'
What went wrong here? Can a server detect a robot even when it has not sent repetitive requests – if yes, how – and is there a way around?
Advertisement
Answer
There are a couple of issues here. The root cause is that the website you are trying to scrape knows you’re not a real person and is blocking you. Lots of websites do this simply by checking headers to see if a request is coming from a browser or not (robot). However, this site looks like they use Incapsula, which is designed to provide more sophisticated protection. You can try and setup your request differently to fool the security on the page by setting headers – but I doubt this will work.
import requests def get_page_source(n): url = 'https://www.whoscored.com/Matches/' + str(n) + '/live' headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'} response = requests.get(url, headers=headers) return response.text n = 1006233 text = get_page_source(n) print text
Looks like the site also uses captchas – which are designed to prevent web scraping. If a site is trying this hard to prevent scraping – it’s likely because the data they provide is proprietary. I would suggest finding another site that provides this data – or try and use an official API.
Check out this (https://stackoverflow.com/a/17769971/701449) answer from a while back. It looks like the whoscored.com uses the OPTA API to provide info. You may be able to skip the middleman and go straight to the source of the data. Good luck!