Take the contents of a tag without taking the contents of its child in web scraping using python

I am scraping data from a newspaper website using beautifulsoup. I am trying to take the news articles and storing them in lists. But there are ad slots in between article paragraphs. I want to take the paragraphs but leave the ad content.

I thought of using a condition that will take the content only if its not in that <div class="ads"> but couldn’t manage to find such.

Here is a similar example for the webpage I am working with. It is a simplified version of the webpage but the problem is the same.

<article>
<p style="text-align:justify"> <strong> Location </strong> News Content 1 </p>

<p style="text-align:justify"> News Content 2 
<div class="ads">
Some random Ad 1
</div>
<br />
News Content 3 <br />
</p>

<p style="text-align:justify"> News Content 4 </p>


</article>

JavaScript
​x
 
<article>
<p style="text-align:justify"> <strong> Location </strong> News Content 1 </p>
​
<p style="text-align:justify"> News Content 2 
<div class="ads">
Some random Ad 1
</div>
<br />
News Content 3 <br />
</p>
​
<p style="text-align:justify"> News Content 4 </p>
​
​
</article>
​

Here is the code snippet I am using to scrape the data from the webpage

soup = bs4.BeautifulSoup(page.content, 'html.parser')

news = soup.find('div',{'class': 'col-md-8 left-container details'})

News_article = news.find_all('div',{'class': 'news-article'})

for fd in News_article:
  find1 = fd.findAll("p")
  
  news_body = ""
  for i in find1:
    news_body += i.getText()

  
  print(news_body)

JavaScript
 
soup = bs4.BeautifulSoup(page.content, 'html.parser')
​
news = soup.find('div',{'class': 'col-md-8 left-container details'})
​
News_article = news.find_all('div',{'class': 'news-article'})
​
for fd in News_article:
  find1 = fd.findAll("p")
  
  news_body = ""
  for i in find1:
    news_body += i.getText()
​
  
  print(news_body)
​
​

What I want

Location
News Content 1
News Content 2
News Content 3
News Content 4

JavaScript
 
Location
News Content 1
News Content 2
News Content 3
News Content 4
​

What I am getting

Location
News Content 1
News Content 2
Some random Ad 1
News Content 3
News Content 4

JavaScript
 
Location
News Content 1
News Content 2
Some random Ad 1
News Content 3
News Content 4
​

I want to take the content of the paragraphs or “p” tags without taking the content of a div inside it. Maybe this is a very easy problem but I am trying for quite some days.

Answer

You can remove unwanted tag using .extract():

soup = bs4.BeautifulSoup(page.content, 'html.parser')
news = soup.find('div',{'class': 'col-md-8 left-container details'})
News_article = news.find_all('div',{'class': 'news-article'})
ads = soup.find_all('div',class_='ads')
for x in ads: 
    x.extract()

JavaScript
 
soup = bs4.BeautifulSoup(page.content, 'html.parser')
news = soup.find('div',{'class': 'col-md-8 left-container details'})
News_article = news.find_all('div',{'class': 'news-article'})
ads = soup.find_all('div',class_='ads')
for x in ads: 
    x.extract()
​

Advertisement

Answer