I’m working on an automated web scraper for a Restaurant website, but I’m having an issue. The said website uses Cloudflare’s anti-bot security, which I would like to bypass, not the Under-Attack-Mode but a captcha test that only triggers when it detects a non-American IP or a bot. I’m trying to bypass it as Cloudflare’s security doesn’t trigger when I clear cookies, disable javascript or when I use an American proxy.
Knowing this, I tried using python’s requests library as such:
import requests headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0'} response = requests.get("https://grimaldis.myguestaccount.com/guest/accountlogin", headers=headers).text print(response)
But this ends up triggering Cloudflare, no matter the proxy I use.
HOWEVER when using urllib.request with the same headers as such:
import urllib.request headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0'} request = urllib.request.Request("https://grimaldis.myguestaccount.com/guest/accountlogin", headers=headers) r = urllib.request.urlopen(request).read() print(r.decode('utf-8'))
When run with the same American IP, this time it does not trigger Cloudflare’s security, even though it uses the same headers and IP used with the requests library.
So I’m trying to figure out what exactly is triggering Cloudflare in the requests library that isn’t in the urllib library.
While the typical answer would be “Just use urllib then”, I’d like to figure out what exactly is different with requests, and how I could fix it, first off to understand how requests works and Cloudflare detects bots, but also so that I may apply any fix I can find to other httplibs (notably asynchronous ones)
EDIT N°2: Progress so far:
Thanks to @TuanGeek we can now bypass the Cloudflare block using requests as long as we connect directly to the host IP rather than the domain name (for some reason, the DNS redirection with requests triggers Cloudflare, but urllib doesn’t):
import requests from collections import OrderedDict import socket # grab the address using socket.getaddrinfo answers = socket.getaddrinfo('grimaldis.myguestaccount.com', 443) (family, type, proto, canonname, (address, port)) = answers[0] headers = OrderedDict({ 'Host': "grimaldis.myguestaccount.com", 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0', }) s = requests.Session() s.headers = headers response = s.get(f"https://{address}/guest/accountlogin", verify=False).text
To note: trying to access via HTTP (rather than HTTPS with the verify variable set to False) will trigger Cloudflare’s block
Now this is great, but unfortunately, my final goal of making this work asynchronously with the httplib HTTPX still isn’t met, as using the following code, the Cloudflare block is still triggered even though we’re connecting directly through the Host IP, with proper headers, and with verifying set to False:
import trio import httpx import socket from collections import OrderedDict answers = socket.getaddrinfo('grimaldis.myguestaccount.com', 443) (family, type, proto, canonname, (address, port)) = answers[0] headers = OrderedDict({ 'Host': "grimaldis.myguestaccount.com", 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0', }) async def asks_worker(): async with httpx.AsyncClient(headers=headers, verify=False) as s: r = await s.get(f'https://{address}/guest/accountlogin') print(r.text) async def run_task(): async with trio.open_nursery() as nursery: nursery.start_soon(asks_worker) trio.run(run_task)
EDIT N°1: For additional details, here’s the raw HTTP request from urllib and requests
REQUESTS:
send: b'GET /guest/nologin/account-balance HTTP/1.1rnAccept-Encoding: identityrnHost: grimaldis.myguestaccount.comrnUser-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0rnConnection: closernrn' reply: 'HTTP/1.1 403 Forbiddenrn' header: Date: Thu, 02 Jul 2020 20:20:06 GMT header: Content-Type: text/html; charset=UTF-8 header: Transfer-Encoding: chunked header: Connection: close header: CF-Chl-Bypass: 1 header: Set-Cookie: __cfduid=df8902e0b19c21b364f3bf33e0b1ce1981593721256; expires=Sat, 01-Aug-20 20:20:06 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure header: Cache-Control: private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0 header: Expires: Thu, 01 Jan 1970 00:00:01 GMT header: X-Frame-Options: SAMEORIGIN header: cf-request-id: 03b2c8d09300000ca181928200000001 header: Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct" header: Set-Cookie: __cfduid=df8962e1b27c25b364f3bf66e8b1ce1981593723206; expires=Sat, 01-Aug-20 20:20:06 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure header: Vary: Accept-Encoding header: Server: cloudflare header: CF-RAY: 5acb25c75c981ca1-EWR
URLLIB:
send: b'GET /guest/nologin/account-balance HTTP/1.1rnAccept-Encoding: identityrnHost: grimaldis.myguestaccount.comrnUser-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0rnConnection: closernrn' reply: 'HTTP/1.1 200 OKrn' header: Date: Thu, 02 Jul 2020 20:20:01 GMT header: Content-Type: text/html;charset=utf-8 header: Transfer-Encoding: chunked header: Connection: close header: Set-Cookie: __cfduid=db9de9687b6c22e6c12b33250a0ded3251292457801; expires=Sat, 01-Aug-20 20:20:01 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure header: Expires: Thu, 2 Jul 2020 20:20:01 GMT header: Cache-Control: no-cache, private, no-store header: X-Powered-By: Undertow/1 header: Pragma: no-cache header: X-Frame-Options: SAMEORIGIN header: Content-Security-Policy: script-src 'self' 'unsafe-inline' 'unsafe-eval' https://www.google-analytics.com https://www.google-analytics.com/analytics.js https://use.typekit.net connect.facebook.net/ https://googleads.g.doubleclick.net/ app.pendo.io cdn.pendo.io pendo-static-6351154740266000.storage.googleapis.com pendo-io-static.storage.googleapis.com https://www.google.com/recaptcha/ https://www.gstatic.com/recaptcha/ https://www.google.com/recaptcha/api.js apis.google.com https://www.googletagmanager.com api.instagram.com https://app-rsrc.getbee.io/plugin/BeePlugin.js https://loader.getbee.io api.instagram.com https://bat.bing.com/bat.js https://www.googleadservices.com/pagead/conversion.js https://connect.facebook.net/en_US/fbevents.js https://connect.facebook.net/ https://fonts.googleapis.com/ https://ssl.gstatic.com/ https://tagmanager.google.com/;style-src 'unsafe-inline' *;img-src * data:;connect-src 'self' app.pendo.io api.feedback.us.pendo.io; frame-ancestors 'self' app.pendo.io pxsweb.com *.pxsweb.com;frame-src 'self' *.myguestaccount.com https://app.getbee.io/ *; header: X-Lift-Version: Unknown Lift Version header: CF-Cache-Status: DYNAMIC header: cf-request-id: 01b2c5b1fa00002654a25485710000001 header: Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct" header: Set-Cookie: __cfduid=db9de811004e591f9a12b66980a5dde331592650101; expires=Sat, 01-Aug-20 20:20:01 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure header: Set-Cookie: __cfduid=db9de811004e591f9a12b66980a5dde331592650101; expires=Sat, 01-Aug-20 20:20:01 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure header: Set-Cookie: __cfduid=db9de811004e591f9a12b66980a5dde331592650101; expires=Sat, 01-Aug-20 20:20:01 GMT; path=/; domain=.myguestaccount.com; HttpOnly; SameSite=Lax; Secure header: Server: cloudflare header: CF-RAY: 5acb58a62c5b5144-EWR
Advertisement
Answer
This really piqued my interests. The requests
solution that I was able to get working.
Solution
Finally narrow down the problem. When you use requests it uses urllib3 connection pool. There seems to be some inconsistency between a regular urllib3 connection and a connection pool. A working solution:
import requests from collections import OrderedDict from requests import Session import socket # grab the address using socket.getaddrinfo answers = socket.getaddrinfo('grimaldis.myguestaccount.com', 443) (family, type, proto, canonname, (address, port)) = answers[0] s = Session() headers = OrderedDict({ 'Accept-Encoding': 'gzip, deflate, br', 'Host': "grimaldis.myguestaccount.com", 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0' }) s.headers = headers response = s.get(f"https://{address}/guest/accountlogin", headers=headers, verify=False).text print(response)
Technical Background
So I ran both method through Burp Suite to compare the requests. Below are the raw dumps of the requests
using requests
GET /guest/accountlogin HTTP/1.1 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0 Accept-Encoding: gzip, deflate Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 Connection: close Host: grimaldis.myguestaccount.com Accept-Language: en-GB,en;q=0.5 Upgrade-Insecure-Requests: 1 dnt: 1
using urllib
GET /guest/accountlogin HTTP/1.1 Host: grimaldis.myguestaccount.com User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0 Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8 Accept-Language: en-GB,en;q=0.5 Accept-Encoding: gzip, deflate Connection: close Upgrade-Insecure-Requests: 1 Dnt: 1
The difference is the ordering of the headers. The difference in the dnt
capitalization is not actually the problem.
So I was able to make a successful request with the following raw request:
GET /guest/accountlogin HTTP/1.1 Host: grimaldis.myguestaccount.com User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0
So the Host
header has be sent above User-Agent
. So if you want to continue to to use requests. Consider using a OrderedDict to ensure the ordering of the headers.