Advertisement

Web scraping from the span element

beautifulsoup python

Devindi Siwurathna

asked 04 Oct, 2021

I am on a scraping project and I am lookin to scrape from the following.

<div class="spec-subcat attributes-religion">
<span class="h5">Faith:</span>
<span>Christian</span>
<span>Islam</span>
</div>

JavaScript
​x
 
<div class="spec-subcat attributes-religion">
<span class="h5">Faith:</span>
<span>Christian</span>
<span>Islam</span>
</div>
​

I want to extract only Christian, Islam as the output.(Without the ‘Faith:’).

This is my try:

faithdiv = soup.find('div', class_='spec-subcat attributes-religion')
faith = faithdiv.find('span').text.strip()

JavaScript
 
faithdiv = soup.find('div', class_='spec-subcat attributes-religion')
faith = faithdiv.find('span').text.strip()
​

How can I make this done?

Advertisement

Answer

There are several ways you can fix this, I would suggest the following – Find all <span> in <div> that have not the class="h5":

soup.select('div.spec-subcat.attributes-religion span:not(.h5)')

JavaScript
 
soup.select('div.spec-subcat.attributes-religion span:not(.h5)')
​

Example

import requests

html_text = '''
<div class="spec-subcat attributes-religion">
<span class="h5">Faith:</span>
<span>Christian</span>
<span>Islam</span>
</div>
'''
soup = BeautifulSoup(html_text, 'lxml')

', '.join([x.get_text() for x in soup.select('div.spec-subcat.attributes-religion span:not(.h5)')])

JavaScript
 
import requests
​
html_text = '''
<div class="spec-subcat attributes-religion">
<span class="h5">Faith:</span>
<span>Christian</span>
<span>Islam</span>
</div>
'''
soup = BeautifulSoup(html_text, 'lxml')
​
', '.join([x.get_text() for x in soup.select('div.spec-subcat.attributes-religion span:not(.h5)')])
​

Output

Christian, Islam

JavaScript
 
Christian, Islam
​

Advertisement