Python POST request for web scraping

Question

I am using beautifulsoup and requests to scrape html contents for this webpage. Based on the selection made in the page -- a list of stations is populated in the page. Clicking on any one station renders an html page with td values. For e.g. My objective is to get data for each station from the list. I am making

Accepted Answer

You are right, it is highly likely the content is dynamically loaded using javascript. Something requests is agnostic about. Moreover, many websites do not like being scraped and employ defenses to mitigate scrapers. The best course of action is to look around for an API that the site provides to satisfy your requirements.Otherwise, you have mainly two options.Simple &#8211; Just need javascriptIn the simplest scenario where the site doesn&#8217;t employ any sophisticated anti webscraping methods, you could simply use a headless browser that interprets javascript, among other things. selenium is a popular tool of choice.Less simple &#8211; Evading detectionIn the case they do try to detect and prevent bots from scraping their site, you&#8217;ll need to investigate how they do it and evade their methods. There isn&#8217;t a one-stop shop solution for this and it requires time and patience. The easiest evasion is when they just white list known User-Agent strings from the request header. Maybe even as easy as just rate throttling. Then, your addition to the header fields will suffice.Much more popular are strong bot detections that poll your &#8220;browser&#8221; for its resolution, try to play a sound through it or try to execute a function known headless browser, such as selenium, are known to have. Healdess browsers fail to evade this and you&#8217;ll have to do the work around it.You can comb through the network requests your browser does (in the developer panel. Default F12 in Firefox) or invest a little more time to learn a tool more fitted for the job such as Zap Proxy. The latter can MiTM your requests and sniff your own network traffic. This you can use to &#8220;diff&#8221; the traffic when a legit request is made (actual browser) VS your script.Good luck!

Python POST request for web scraping

Advertisement

Answer

Simple – Just need javascript

Less simple – Evading detection