I am attempting to scrape some data from a website using a POST request with the Python requests
library. Unfortunately I am unable to post a link to the page as you must be signed in to the website to site to use it.
The request I am trying to replicate has the file extension .ehtml and this is part of the Request payload I am looking to recreate:
------WebKitFormBoundary8rntuVzldIBHkILv
Content-Disposition: form-data; name="session_id"
W0pNKn8AAQEAACD-XkYAAAAJ
------WebKitFormBoundary8rntuVzldIBHkILv
Content-Disposition: form-data; name="p_session_id"
W0pMOH8AAQEAABZSUVkAAAAD
------WebKitFormBoundary8rntuVzldIBHkILv
Content-Disposition: form-data; name="attach_key"
------WebKitFormBoundary8rntuVzldIBHkILv
Content-Disposition: form-data; name="chosen"
0
------WebKitFormBoundary8rntuVzldIBHkILv
Content-Disposition: form-data; name="debug"
------WebKitFormBoundary8rntuVzldIBHkILv
Content-Disposition: form-data; name="language"
en
------WebKitFormBoundary8rntuVzldIBHkILv
Content-Disposition: form-data; name="game_system_id"
NULL
------WebKitFormBoundary8rntuVzldIBHkILv
Content-Disposition: form-data; name="collection_detail_id"
NULL
------WebKitFormBoundary8rntuVzldIBHkILv
Content-Disposition: form-data; name="competition_id"
NULL
Using some help from some of the questions on stackoverflow, I have managed to recreate it this far:
--30b11983bde849109a3dc93e139e16d4
Content-Disposition: form-data; name="session_id"
--30b11983bde849109a3dc93e139e16d4
Content-Disposition: form-data; name="p_session_id"
--30b11983bde849109a3dc93e139e16d4
Content-Disposition: form-data; name="attach_key"
--30b11983bde849109a3dc93e139e16d4
Content-Disposition: form-data; name="chosen"
0
--30b11983bde849109a3dc93e139e16d4
Content-Disposition: form-data; name="debug"
--30b11983bde849109a3dc93e139e16d4
Content-Disposition: form-data; name="language"
en
--30b11983bde849109a3dc93e139e16d4
Content-Disposition: form-data; name="game_system_id"
NULL
--30b11983bde849109a3dc93e139e16d4
Content-Disposition: form-data; name="collection_detail_id"
NULL
--30b11983bde849109a3dc93e139e16d4
Content-Disposition: form-data; name="competition_id"
NULL
That was done using this code:
Q = {
"session_id" : (None,""),
"p_session_id" : (None,""),
"attach_key" : (None,""),
"chosen" : (None,"0"),
"debug" : (None,""),
"language" : (None,"en"),
"game_system_id" : (None,"NULL"),
"collection_detail_id" : (None,"NULL"),
"competition_id" : (None,"NULL")
}
with requests.Session() as s:
p = s.post(login_URL2,data=payload)
#print(p.text)
#d = s.post(req_url,files=Q)
d2 = Request("POST",req_url,files=Q)
d3 = d2.prepare()
print(d3.body.decode('utf-8'))
I believe the last thing I am missing is the WebKitFormBoundary part, I am unable to find anywhere how to insert that part. This is my first time scraping using an .ehtml file, so if I have missed anything else obvious, all help is much appreciated.
Advertisement
Answer
The exact name of the boundary does not matter as long as it is declared in the header:
Content-Type: multipart/mixed; boundary=gc0p4Jq0M2Yt08jU534c0p
With this header the boundaries would be
--gc0p4Jq0M2Yt08jU534c0p
There server will take a look at the Content-Type
header and figure out the body parts.