Skip to content
Advertisement

Python POST request for web scraping

I am using beautifulsoup and requests to scrape html contents for this webpage.

Based on the selection made in the page — a list of stations is populated in the page. Clicking on any one station renders an html page with td values.

For e.g.

1. State Name - West Bengal 
2. District Name - Bardhman

List of stations: Chitranjan, Damodar Rl Bridge,....

My objective is to get data for each station from the list.

I am making a post request but in the response I am not getting any td tag values. (may be dynamically loaded)

Code:

from bs4 import BeautifulSoup

headers = {
    'Content-Type':'application/x-www-form-urlencoded',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)  Chrome/81.0.4044.113 Safari/537.36'
}
   
cookies = {
   'JSESSIONID':'A95A81E6F668F00E677AD460CD3DBB99'
}
    

data = {
  'lstStation':'014-DDASL'
}
  
response = requests.post('http://india-water.gov.in/ffs/data-flow-list-based/flood-forecasted-site/', headers=headers, data=data, cookies=cookies)

soup = BeautifulSoup(response.content, 'html.parser')

#print (soup.text)
all_td = soup.select ('td')

for td in all_td:
    print (td.text)

Any help would be appreciated. Thanks!

Advertisement

Answer

You are right, it is highly likely the content is dynamically loaded using javascript. Something requests is agnostic about. Moreover, many websites do not like being scraped and employ defenses to mitigate scrapers. The best course of action is to look around for an API that the site provides to satisfy your requirements.
Otherwise, you have mainly two options.

Simple – Just need javascript

In the simplest scenario where the site doesn’t employ any sophisticated anti webscraping methods, you could simply use a headless browser that interprets javascript, among other things. selenium is a popular tool of choice.

Less simple – Evading detection

In the case they do try to detect and prevent bots from scraping their site, you’ll need to investigate how they do it and evade their methods. There isn’t a one-stop shop solution for this and it requires time and patience. The easiest evasion is when they just white list known User-Agent strings from the request header. Maybe even as easy as just rate throttling. Then, your addition to the header fields will suffice.
Much more popular are strong bot detections that poll your “browser” for its resolution, try to play a sound through it or try to execute a function known headless browser, such as selenium, are known to have. Healdess browsers fail to evade this and you’ll have to do the work around it.

You can comb through the network requests your browser does (in the developer panel. Default F12 in Firefox) or invest a little more time to learn a tool more fitted for the job such as Zap Proxy. The latter can MiTM your requests and sniff your own network traffic. This you can use to “diff” the traffic when a legit request is made (actual browser) VS your script.

Good luck!

Advertisement