I am trying to use Python requests
to download a PDF from PeerJ. For example, https://peerj.com/articles/1.pdf.
My code is simply:
r = requests.get('https://peerj.com/articles/1.pdf')
However, the Response
object returned displays as <Response [432]>
, which indicates an HTTP 432 error. As far as I know, that error code is not assigned.
When I examine r.text
or r.content
, there is some HTML which says that it’s an error 432 and gives a link to the same PDF, https://peerj.com/articles/1.pdf.
I can view the PDF when I open it in my browser (Chrome).
How do I get the actual PDF (as a bytes
object, like I should get from r.content
)?
Advertisement
Answer
While opening the site, you have mentioned, I also opened the developer tool in my firefox browser and copied the http request header from there and assigned it to headers parameter in request.get funcion.
a = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,/;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.5',
'Connection': 'keep-alive',
'Host': 'peerj.com',
'Referer': 'https://peerj.com/articles/1.pdf',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'same-origin',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0'}
r = requests.get(‘https://peerj.com/articles/1.pdf’, headers= a)