I am attempting to scrape some data from a website using a POST request with the Python requests
library. Unfortunately I am unable to post a link to the page as you must be signed in to the website to site to use it.
The request I am trying to replicate has the file extension .ehtml and this is part of the Request payload I am looking to recreate:
------WebKitFormBoundary8rntuVzldIBHkILv Content-Disposition: form-data; name="session_id" W0pNKn8AAQEAACD-XkYAAAAJ ------WebKitFormBoundary8rntuVzldIBHkILv Content-Disposition: form-data; name="p_session_id" W0pMOH8AAQEAABZSUVkAAAAD ------WebKitFormBoundary8rntuVzldIBHkILv Content-Disposition: form-data; name="attach_key" ------WebKitFormBoundary8rntuVzldIBHkILv Content-Disposition: form-data; name="chosen" 0 ------WebKitFormBoundary8rntuVzldIBHkILv Content-Disposition: form-data; name="debug" ------WebKitFormBoundary8rntuVzldIBHkILv Content-Disposition: form-data; name="language" en ------WebKitFormBoundary8rntuVzldIBHkILv Content-Disposition: form-data; name="game_system_id" NULL ------WebKitFormBoundary8rntuVzldIBHkILv Content-Disposition: form-data; name="collection_detail_id" NULL ------WebKitFormBoundary8rntuVzldIBHkILv Content-Disposition: form-data; name="competition_id" NULL
Using some help from some of the questions on stackoverflow, I have managed to recreate it this far:
--30b11983bde849109a3dc93e139e16d4 Content-Disposition: form-data; name="session_id" --30b11983bde849109a3dc93e139e16d4 Content-Disposition: form-data; name="p_session_id" --30b11983bde849109a3dc93e139e16d4 Content-Disposition: form-data; name="attach_key" --30b11983bde849109a3dc93e139e16d4 Content-Disposition: form-data; name="chosen" 0 --30b11983bde849109a3dc93e139e16d4 Content-Disposition: form-data; name="debug" --30b11983bde849109a3dc93e139e16d4 Content-Disposition: form-data; name="language" en --30b11983bde849109a3dc93e139e16d4 Content-Disposition: form-data; name="game_system_id" NULL --30b11983bde849109a3dc93e139e16d4 Content-Disposition: form-data; name="collection_detail_id" NULL --30b11983bde849109a3dc93e139e16d4 Content-Disposition: form-data; name="competition_id" NULL
That was done using this code:
Q = { "session_id" : (None,""), "p_session_id" : (None,""), "attach_key" : (None,""), "chosen" : (None,"0"), "debug" : (None,""), "language" : (None,"en"), "game_system_id" : (None,"NULL"), "collection_detail_id" : (None,"NULL"), "competition_id" : (None,"NULL") } with requests.Session() as s: p = s.post(login_URL2,data=payload) #print(p.text) #d = s.post(req_url,files=Q) d2 = Request("POST",req_url,files=Q) d3 = d2.prepare() print(d3.body.decode('utf-8'))
I believe the last thing I am missing is the WebKitFormBoundary part, I am unable to find anywhere how to insert that part. This is my first time scraping using an .ehtml file, so if I have missed anything else obvious, all help is much appreciated.
Advertisement
Answer
The exact name of the boundary does not matter as long as it is declared in the header:
Content-Type: multipart/mixed; boundary=gc0p4Jq0M2Yt08jU534c0p
With this header the boundaries would be
--gc0p4Jq0M2Yt08jU534c0p
There server will take a look at the Content-Type
header and figure out the body parts.