I’m trying to fetch tabular content from a webpage using the requests module. After navigating to that webpage, when I manually type 0466425389
right next to Company number
and hit the search button, the table is produced accordingly. However, when I mimic the same using requests, I get the following response.
<?xml version='1.0' encoding='UTF-8'?> <partial-response><redirect url="/bc9/web/catalog"></redirect></partial-response>
I’ve tried with:
import requests link = 'https://cri.nbb.be/bc9/web/catalog?execution=e1s1' payload = { 'javax.faces.partial.ajax': 'true', 'javax.faces.source': 'page_searchForm:actions:0:button', 'javax.faces.partial.execute': 'page_searchForm', 'javax.faces.partial.render': 'page_searchForm page_listForm pageMessagesId', 'page_searchForm:actions:0:button': 'page_searchForm:actions:0:button', 'page_searchForm': 'page_searchForm', 'page_searchForm:j_id3:generated_number_2_component': '0466425389', 'page_searchForm:j_id3:generated_name_4_component': '', 'page_searchForm:j_id3:generated_address_zipCode_6_component': '', 'page_searchForm:j_id3_activeIndex': '0', 'page_searchForm:j_id2_stateholder': 'panel_param_visible;', 'page_searchForm:j_idt133_stateholder': 'panel_param_visible;', 'javax.faces.ViewState': 'e1s1' } headers = { 'Faces-Request': 'partial/ajax', 'X-Requested-With': 'XMLHttpRequest', 'Origin': 'https://cri.nbb.be', 'Accept': 'application/xml, text/xml, */*; q=0.01', 'Accept-Encoding': 'gzip, deflate, br', 'Host': 'cri.nbb.be', 'Origin': 'https://cri.nbb.be', 'Referer': 'https://cri.nbb.be/bc9/web/catalog?execution=e1s1' } with requests.Session() as s: s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36' s.get(link) s.headers.update(headers) res = s.post(link,data=payload) print(res.text)
How can I fetch tabular content from that site using requests?
Advertisement
Answer
From looking at the “action” attribute on the search form, the form appears to generate a new JSESSIONID every time it is opened, and this seems to be a required attribute. I had some success by including this in the URL.
You don’t need to explicitly set the headers other than the User-Agent.
I added some code: (a) to pull out the “action” attribute of the form using BeautifulSoup – you could do this with regex if you prefer, (b) to get the url from that redirection XML that you showed at the top of your question.
import re from urllib.parse import urljoin import requests from bs4 import BeautifulSoup ... with requests.Session() as s: s.headers["User-Agent"] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36" # GET to get search form req1 = s.get(link) # Get the form action soup = BeautifulSoup(req1.text, "lxml") form = soup.select_one("#page_searchForm") form_action = urljoin(link, form["action"]) # POST the form req2 = s.post(form_action, data=payload) # Extract the target from the redirection xml response target = re.search('url="(.*?)"', req2.text).group(1) # Final GET to get the search result req3 = s.get(urljoin(link, target)) # Parse and print (some of) the result soup = BeautifulSoup(req3.text, "lxml").body for detail in soup.select(".company-details tr"): columns = detail.select("td") if columns: print(f"{columns[0].text.strip()}: {columns[1].text.strip()}")
Result:
Company number: 0466.425.389 Name: A en B PARTNERS Address: Quai de Willebroeck 37 : BE 1000 Bruxelles Municipality code NIS: 21004 Bruxelles Legal form: Cooperative company with limited liability Legal situation: Normal situation Activity code (NACE-BEL) The activity code of the company is the statistical activity code in use on the date of consultation, given by the CBSO based on the main activity codes available at the Crossroads Bank for Enterprises and supplementary informations collected from the companies: 69201 - Accountants and fiscal advisors