Skip to content
Advertisement

Webscraping a particular element of html

I’m having trouble scraping information from government travel advice websites for a research project I’m doing on Python.

I’ve picked the Turkey page but the logic could extend to any country.

The site is “https://www.gov.uk/foreign-travel-advice/turkey/safety-and-security

The code I’m using is:

import requests
page = requests.get("https://www.gov.uk/foreign-travel-advice/turkey/safety-
and-security")
page
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('p')
soup.find_all('p')[0].get_text()

At the moment this is extracting all the html of the page. Having inspected the website the information I am interested in is located in:

<div class="govuk-govspeak direction-ltr">
  <p>

Does anyone know how to amend the code above to only extract that part of the html?

Thanks

Advertisement

Answer

If you are only interested in data located inside govuk-govspeak direction-ltr class, therefore you can try these steps :

Beautiful Soup supports the most commonly-used CSS selectors. Just pass a string into the .select() method of a Tag object or the BeautifulSoup object itself. For class use . and for id use #

data = soup.select('.govuk-govspeak.direction-ltr')

# extract h3 tags
h3_tags = data[0].select('h3')
print(h3_tags)
[<h3 id="local-travel---syrian-border">Local travel - Syrian border</h3>, <h3 id="local-travel--eastern-provinces">Local travel – eastern provinces</h3>, <h3 id="political-situation">Political situation</h3>,...]

#extract p tags
p3_tags = data[0].select('p')
[<p>The FCO advise against all travel to within 10 ...]
Advertisement