I am scraping data from a newspaper website using beautifulsoup. I am trying to take the news articles and storing them in lists. But there are ad slots in between article paragraphs. I want to take the paragraphs but leave the ad content.
I thought of using a condition that will take the content only if its not in that <div class="ads">
but couldn’t manage to find such.
Here is a similar example for the webpage I am working with. It is a simplified version of the webpage but the problem is the same.
<article> <p style="text-align:justify"> <strong> Location </strong> News Content 1 </p> <p style="text-align:justify"> News Content 2 <div class="ads"> Some random Ad 1 </div> <br /> News Content 3 <br /> </p> <p style="text-align:justify"> News Content 4 </p> </article>
Here is the code snippet I am using to scrape the data from the webpage
soup = bs4.BeautifulSoup(page.content, 'html.parser') news = soup.find('div',{'class': 'col-md-8 left-container details'}) News_article = news.find_all('div',{'class': 'news-article'}) for fd in News_article: find1 = fd.findAll("p") news_body = "" for i in find1: news_body += i.getText() print(news_body)
What I want
Location News Content 1 News Content 2 News Content 3 News Content 4
What I am getting
Location News Content 1 News Content 2 Some random Ad 1 News Content 3 News Content 4
I want to take the content of the paragraphs or “p” tags without taking the content of a div inside it. Maybe this is a very easy problem but I am trying for quite some days.
Advertisement
Answer
You can remove unwanted tag using .extract()
:
soup = bs4.BeautifulSoup(page.content, 'html.parser') news = soup.find('div',{'class': 'col-md-8 left-container details'}) News_article = news.find_all('div',{'class': 'news-article'}) ads = soup.find_all('div',class_='ads') for x in ads: x.extract()