I am trying to scrape the main table with tag :
<table _ngcontent-jna-c4="" class="rayanDynamicStatement">
from following website using ‘BeautifulSoup’ library, but the code returns empty [] while printing soup returns html string and request status is 200. I found out that when i use browser ‘inspect element’ tool i can see the table tag but in “view page source” the table tag which is part of “app-root” tag is not shown. (you see <app-root></app-root>
which is empty). Besides there is no “json” file in the webpage’s components to extract data from it. Please help me how can I scrape the table data.
import urllib.request
import pandas as pd
from urllib.parse import unquote
from bs4 import BeautifulSoup
yurl='https://www.codal.ir/Reports/Decision.aspx?LetterSerial=T1hETjlDjOQQQaQQQfaL0Mb7uucg%3D%3D&rt=0&let=6&ct=0&ft=-1&sheetId=0'
req=urllib.request.urlopen(yurl)
print(req.status)
#get response
response = req.read()
html = response.decode("utf-8")
#make html readable
soup = BeautifulSoup(html, features="html")
table_body=soup.find_all("table")
print(table_body)
Advertisement
Answer
The table is in the source HTML
but kinda hidden and then rendered by JavaScript
. It’s in one of the <script>
tags. This can be located with bs4
and then parsed with regex
. Finally, the table data can be dumped to json.loads
then to a pandas
and to a .csv
file, but since I don’t know any Persian, you’d have to see if it’s of any use.
Just by looking at some values, I think it is.
Oh, and this can be done without selenium
.
Here’s how:
import pandas as pd
import json
import re
import requests
from bs4 import BeautifulSoup
url = "https://www.codal.ir/Reports/Decision.aspx?LetterSerial=T1hETjlDjOQQQaQQQfaL0Mb7uucg%3D%3D&rt=0&let=6&ct=0&ft=-1&sheetId=0"
scripts = BeautifulSoup(
requests.get(url, verify=False).content,
"lxml",
).find_all("script", {"type": "text/javascript"})
table_data = json.loads(
re.search(r"var datasource = ({.*})", scripts[-5].string).group(1),
)
pd.DataFrame(
table_data["sheets"][0]["tables"][0]["cells"],
).to_csv("huge_table.csv", index=False)
This outputs a huge file that looks like this: