I am on a scraping project and I am lookin to scrape from the following.
JavaScript
x
6
1
<div class="spec-subcat attributes-religion">
2
<span class="h5">Faith:</span>
3
<span>Christian</span>
4
<span>Islam</span>
5
</div>
6
I want to extract only Christian, Islam as the output.(Without the ‘Faith:’).
This is my try:
JavaScript
1
3
1
faithdiv = soup.find('div', class_='spec-subcat attributes-religion')
2
faith = faithdiv.find('span').text.strip()
3
How can I make this done?
Advertisement
Answer
There are several ways you can fix this, I would suggest the following – Find all <span>
in <div>
that have not the class="h5"
:
JavaScript
1
2
1
soup.select('div.spec-subcat.attributes-religion span:not(.h5)')
2
Example
JavaScript
1
13
13
1
import requests
2
3
html_text = '''
4
<div class="spec-subcat attributes-religion">
5
<span class="h5">Faith:</span>
6
<span>Christian</span>
7
<span>Islam</span>
8
</div>
9
'''
10
soup = BeautifulSoup(html_text, 'lxml')
11
12
', '.join([x.get_text() for x in soup.select('div.spec-subcat.attributes-religion span:not(.h5)')])
13
Output
JavaScript
1
2
1
Christian, Islam
2