Skip to content
Advertisement

Python’s requests triggers Cloudflare’s security while urllib does not

I’m working on an automated web scraper for a Restaurant website, but I’m having an issue. The said website uses Cloudflare’s anti-bot security, which I would like to bypass, not the Under-Attack-Mode but a captcha test that only triggers when it detects a non-American IP or a bot. I’m trying to bypass it as Cloudflare’s security doesn’t trigger when I clear cookies, disable javascript or when I use an American proxy.

Knowing this, I tried using python’s requests library as such:

JavaScript

But this ends up triggering Cloudflare, no matter the proxy I use.

HOWEVER when using urllib.request with the same headers as such:

JavaScript

When run with the same American IP, this time it does not trigger Cloudflare’s security, even though it uses the same headers and IP used with the requests library.

So I’m trying to figure out what exactly is triggering Cloudflare in the requests library that isn’t in the urllib library.

While the typical answer would be “Just use urllib then”, I’d like to figure out what exactly is different with requests, and how I could fix it, first off to understand how requests works and Cloudflare detects bots, but also so that I may apply any fix I can find to other httplibs (notably asynchronous ones)

EDIT N°2: Progress so far:

Thanks to @TuanGeek we can now bypass the Cloudflare block using requests as long as we connect directly to the host IP rather than the domain name (for some reason, the DNS redirection with requests triggers Cloudflare, but urllib doesn’t):

JavaScript

To note: trying to access via HTTP (rather than HTTPS with the verify variable set to False) will trigger Cloudflare’s block

Now this is great, but unfortunately, my final goal of making this work asynchronously with the httplib HTTPX still isn’t met, as using the following code, the Cloudflare block is still triggered even though we’re connecting directly through the Host IP, with proper headers, and with verifying set to False:

JavaScript

EDIT N°1: For additional details, here’s the raw HTTP request from urllib and requests

REQUESTS:

JavaScript

URLLIB:

JavaScript

Advertisement

Answer

This really piqued my interests. The requests solution that I was able to get working.

Solution

Finally narrow down the problem. When you use requests it uses urllib3 connection pool. There seems to be some inconsistency between a regular urllib3 connection and a connection pool. A working solution:

JavaScript

Technical Background

So I ran both method through Burp Suite to compare the requests. Below are the raw dumps of the requests

using requests

JavaScript

using urllib

JavaScript

The difference is the ordering of the headers. The difference in the dnt capitalization is not actually the problem.

So I was able to make a successful request with the following raw request:

JavaScript

So the Host header has be sent above User-Agent. So if you want to continue to to use requests. Consider using a OrderedDict to ensure the ordering of the headers.

Advertisement