I am trying to scrape a website for COVID related data. The data is enclosed in an iframe
tag. I tried to scrape the results using beautifulsoup
but couldn’t extract #document
. Here’s my approach
JavaScript
x
11
11
1
import requests
2
from bs4 import BeautifulSoup
3
with requests.Session() as s:
4
coo = s.get("https://www.theguardian.com/", headers={'User-Agent': 'Mozilla/5.0'})
5
cookies = dict(coo.cookies)
6
url = "https://www.theguardian.com/world/2020/oct/25/covid-world-map-countries-most-coronavirus-cases-deaths"
7
webpage = s.get(url, headers={'User-Agent': 'Mozilla/5.0'}, cookies = cookies)
8
soup = BeautifulSoup(webpage.content, "html.parser")
9
frame = soup.find("iframe", class_ = "interactive-atom-fence")
10
print(frame)
11
My results:
Inspect Data from website:
Can somebody explain that why the #document
part is missing from my results?
Advertisement
Answer
However, The Guardian
offers an entire .csv
file up for grabs, if you take a look at what’s going on in the Developer Tool.
Here’s how to grab data for Covid19 Gloabal Deaths
:
JavaScript
1
11
11
1
import shutil
2
3
import requests
4
5
url = "https://interactive.guim.co.uk/2020/coronavirus-jh-timeline-data/time_series_covid19_deaths_global.csv"
6
data = requests.get(url, stream=True)
7
if data.status_code == 200:
8
with open("covid19_data.csv", 'wb') as f:
9
data.raw.decode_content = True
10
shutil.copyfileobj(data.raw, f)
11
And if you swap the last part of the URL
with time_series_covid19_confirmed_global.csv
that’s what you’re going to get back as a .csv
file.