The page that I’m scraping contains these HTML codes. How do I remove the comment tag <!-- -->
along with its content with bs4?
JavaScript
x
13
13
1
<div class="foo">
2
cat dog sheep goat
3
<!--
4
<p>NewPP limit report
5
Preprocessor node count: 478/300000
6
Post‐expand include size: 4852/2097152 bytes
7
Template argument size: 870/2097152 bytes
8
Expensive parser function count: 2/100
9
ExtLoops count: 6/100
10
</p>
11
-->
12
</div>
13
Advertisement
Answer
You can use extract()
(solution is based on this answer):
PageElement.extract() removes a tag or string from the tree. It returns the tag or string that was extracted.
JavaScript
1
17
17
1
from bs4 import BeautifulSoup, Comment
2
3
data = """<div class="foo">
4
cat dog sheep goat
5
<!--
6
<p>test</p>
7
-->
8
</div>"""
9
10
soup = BeautifulSoup(data)
11
12
div = soup.find('div', class_='foo')
13
for element in div(text=lambda text: isinstance(text, Comment)):
14
element.extract()
15
16
print soup.prettify()
17
As a result you get your div
without comments:
JavaScript
1
4
1
<div class="foo">
2
cat dog sheep goat
3
</div>
4