I am trying to scrape the main table with tag :
<table _ngcontent-jna-c4="" class="rayanDynamicStatement">
from following website using ‘BeautifulSoup’ library, but the code returns empty [] while printing soup returns html string and request status is 200. I found out that when i use browser ‘inspect element’ tool i can see the table tag but in “view page source” the table tag which is part of “app-root” tag is not shown. (you see <app-root></app-root>
which is empty). Besides there is no “json” file in the webpage’s components to extract data from it. Please help me how can I scrape the table data.
import urllib.request import pandas as pd from urllib.parse import unquote from bs4 import BeautifulSoup yurl='https://www.codal.ir/Reports/Decision.aspx?LetterSerial=T1hETjlDjOQQQaQQQfaL0Mb7uucg%3D%3D&rt=0&let=6&ct=0&ft=-1&sheetId=0' req=urllib.request.urlopen(yurl) print(req.status) #get response response = req.read() html = response.decode("utf-8") #make html readable soup = BeautifulSoup(html, features="html") table_body=soup.find_all("table") print(table_body)
Advertisement
Answer
The table is in the source HTML
but kinda hidden and then rendered by JavaScript
. It’s in one of the <script>
tags. This can be located with bs4
and then parsed with regex
. Finally, the table data can be dumped to json.loads
then to a pandas
and to a .csv
file, but since I don’t know any Persian, you’d have to see if it’s of any use.
Just by looking at some values, I think it is.
Oh, and this can be done without selenium
.
Here’s how:
import pandas as pd import json import re import requests from bs4 import BeautifulSoup url = "https://www.codal.ir/Reports/Decision.aspx?LetterSerial=T1hETjlDjOQQQaQQQfaL0Mb7uucg%3D%3D&rt=0&let=6&ct=0&ft=-1&sheetId=0" scripts = BeautifulSoup( requests.get(url, verify=False).content, "lxml", ).find_all("script", {"type": "text/javascript"}) table_data = json.loads( re.search(r"var datasource = ({.*})", scripts[-5].string).group(1), ) pd.DataFrame( table_data["sheets"][0]["tables"][0]["cells"], ).to_csv("huge_table.csv", index=False)
This outputs a huge file that looks like this: