Skip to content
Advertisement

Download PDF from PeerJ

I am trying to use Python requests to download a PDF from PeerJ. For example, https://peerj.com/articles/1.pdf.

My code is simply:

r = requests.get('https://peerj.com/articles/1.pdf')

However, the Response object returned displays as <Response [432]>, which indicates an HTTP 432 error. As far as I know, that error code is not assigned.

When I examine r.text or r.content, there is some HTML which says that it’s an error 432 and gives a link to the same PDF, https://peerj.com/articles/1.pdf.

I can view the PDF when I open it in my browser (Chrome).

How do I get the actual PDF (as a bytes object, like I should get from r.content)?

Advertisement

Answer

While opening the site, you have mentioned, I also opened the developer tool in my firefox browser and copied the http request header from there and assigned it to headers parameter in request.get funcion.

a = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,/;q=0.8', 'Accept-Encoding': 'gzip, deflate, br', 'Accept-Language': 'en-US,en;q=0.5', 'Connection': 'keep-alive', 'Host': 'peerj.com', 'Referer': 'https://peerj.com/articles/1.pdf', 'Sec-Fetch-Dest': 'document', 'Sec-Fetch-Mode': 'navigate', 'Sec-Fetch-Site': 'same-origin', 'Sec-Fetch-User': '?1', 'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0'}

r = requests.get(‘https://peerj.com/articles/1.pdf’, headers= a)

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement