The page that I’m scraping contains these HTML codes. How do I remove the comment tag <!-- -->
along with its content with bs4?
<div class="foo"> cat dog sheep goat <!-- <p>NewPP limit report Preprocessor node count: 478/300000 Post‐expand include size: 4852/2097152 bytes Template argument size: 870/2097152 bytes Expensive parser function count: 2/100 ExtLoops count: 6/100 </p> --> </div>
Advertisement
Answer
You can use extract()
(solution is based on this answer):
PageElement.extract() removes a tag or string from the tree. It returns the tag or string that was extracted.
from bs4 import BeautifulSoup, Comment data = """<div class="foo"> cat dog sheep goat <!-- <p>test</p> --> </div>""" soup = BeautifulSoup(data) div = soup.find('div', class_='foo') for element in div(text=lambda text: isinstance(text, Comment)): element.extract() print soup.prettify()
As a result you get your div
without comments:
<div class="foo"> cat dog sheep goat </div>