I’m having trouble scraping information from government travel advice websites for a research project I’m doing on Python.
I’ve picked the Turkey page but the logic could extend to any country.
The site is “https://www.gov.uk/foreign-travel-advice/turkey/safety-and-security“
The code I’m using is:
import requests
page = requests.get("https://www.gov.uk/foreign-travel-advice/turkey/safety-
and-security")
page
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')
soup.find_all('p')[0].get_text()
At the moment this is extracting all the html of the page. Having inspected the website the information I am interested in is located in:
<div class="govuk-govspeak direction-ltr">
<p>
Does anyone know how to amend the code above to only extract that part of the html?
Thanks
Advertisement
Answer
If you are only interested in data located inside govuk-govspeak direction-ltr
class, therefore you can try these steps :
Beautiful Soup supports the most commonly-used CSS selectors
. Just pass a string into the .select()
method of a Tag object or the BeautifulSoup
object itself. For class
use .
and for id
use #
data = soup.select('.govuk-govspeak.direction-ltr')
# extract h3 tags
h3_tags = data[0].select('h3')
print(h3_tags)
[<h3 id="local-travel---syrian-border">Local travel - Syrian border</h3>, <h3 id="local-travel--eastern-provinces">Local travel – eastern provinces</h3>, <h3 id="political-situation">Political situation</h3>,...]
#extract p tags
p3_tags = data[0].select('p')
[<p>The FCO advise against all travel to within 10 ]