I am using beautifulsoup
and requests
to scrape html contents for this webpage.
Based on the selection made in the page — a list of stations is populated in the page. Clicking on any one station renders an html page with td
values.
For e.g.
1. State Name - West Bengal 2. District Name - Bardhman List of stations: Chitranjan, Damodar Rl Bridge,....
My objective is to get data for each station from the list.
I am making a post request but in the response I am not getting any td
tag values. (may be dynamically loaded)
Code:
from bs4 import BeautifulSoup headers = { 'Content-Type':'application/x-www-form-urlencoded', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.113 Safari/537.36' } cookies = { 'JSESSIONID':'A95A81E6F668F00E677AD460CD3DBB99' } data = { 'lstStation':'014-DDASL' } response = requests.post('http://india-water.gov.in/ffs/data-flow-list-based/flood-forecasted-site/', headers=headers, data=data, cookies=cookies) soup = BeautifulSoup(response.content, 'html.parser') #print (soup.text) all_td = soup.select ('td') for td in all_td: print (td.text)
Any help would be appreciated. Thanks!
Advertisement
Answer
You are right, it is highly likely the content is dynamically loaded using javascript. Something requests
is agnostic about. Moreover, many websites do not like being scraped and employ defenses to mitigate scrapers. The best course of action is to look around for an API that the site provides to satisfy your requirements.
Otherwise, you have mainly two options.
Simple – Just need javascript
In the simplest scenario where the site doesn’t employ any sophisticated anti webscraping methods, you could simply use a headless browser that interprets javascript, among other things. selenium
is a popular tool of choice.
Less simple – Evading detection
In the case they do try to detect and prevent bots from scraping their site, you’ll need to investigate how they do it and evade their methods. There isn’t a one-stop shop solution for this and it requires time and patience. The easiest evasion is when they just white list known User-Agent strings from the request header. Maybe even as easy as just rate throttling. Then, your addition to the header fields will suffice.
Much more popular are strong bot detections that poll your “browser” for its resolution, try to play a sound through it or try to execute a function known headless browser, such as selenium
, are known to have. Healdess browsers fail to evade this and you’ll have to do the work around it.
You can comb through the network requests your browser does (in the developer panel. Default F12 in Firefox) or invest a little more time to learn a tool more fitted for the job such as Zap Proxy. The latter can MiTM your requests and sniff your own network traffic. This you can use to “diff” the traffic when a legit request is made (actual browser) VS your script.
Good luck!