I’m having trouble scraping information from government travel advice websites for a research project I’m doing on Python.
I’ve picked the Turkey page but the logic could extend to any country.
The site is “https://www.gov.uk/foreign-travel-advice/turkey/safety-and-security“
The code I’m using is:
import requests page = requests.get("https://www.gov.uk/foreign-travel-advice/turkey/safety- and-security") page from bs4 import BeautifulSoup soup = BeautifulSoup(page.content, 'html.parser') soup.find_all('p') soup.find_all('p')[0].get_text()
At the moment this is extracting all the html of the page. Having inspected the website the information I am interested in is located in:
<div class="govuk-govspeak direction-ltr"> <p>
Does anyone know how to amend the code above to only extract that part of the html?
Thanks
Advertisement
Answer
If you are only interested in data located inside govuk-govspeak direction-ltr
class, therefore you can try these steps :
Beautiful Soup supports the most commonly-used CSS selectors
. Just pass a string into the .select()
method of a Tag object or the BeautifulSoup
object itself. For class
use .
and for id
use #
data = soup.select('.govuk-govspeak.direction-ltr') # extract h3 tags h3_tags = data[0].select('h3') print(h3_tags) [<h3 id="local-travel---syrian-border">Local travel - Syrian border</h3>, <h3 id="local-travel--eastern-provinces">Local travel – eastern provinces</h3>, <h3 id="political-situation">Political situation</h3>,...] #extract p tags p3_tags = data[0].select('p') [<p>The FCO advise against all travel to within 10 ...]