Skip to content
Advertisement

Web scraping from the span element

I am on a scraping project and I am lookin to scrape from the following.

<div class="spec-subcat attributes-religion">
<span class="h5">Faith:</span>
<span>Christian</span>
<span>Islam</span>
</div>

I want to extract only Christian, Islam as the output.(Without the ‘Faith:’).

This is my try:

faithdiv = soup.find('div', class_='spec-subcat attributes-religion')
faith = faithdiv.find('span').text.strip()

How can I make this done?

Advertisement

Answer

There are several ways you can fix this, I would suggest the following – Find all <span> in <div> that have not the class="h5":

soup.select('div.spec-subcat.attributes-religion span:not(.h5)')

Example

import requests

html_text = '''
<div class="spec-subcat attributes-religion">
<span class="h5">Faith:</span>
<span>Christian</span>
<span>Islam</span>
</div>
'''
soup = BeautifulSoup(html_text, 'lxml')

', '.join([x.get_text() for x in soup.select('div.spec-subcat.attributes-religion span:not(.h5)')])

Output

Christian, Islam
User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement