Skip to content
Advertisement

Take the contents of a tag without taking the contents of its child in web scraping using python

I am scraping data from a newspaper website using beautifulsoup. I am trying to take the news articles and storing them in lists. But there are ad slots in between article paragraphs. I want to take the paragraphs but leave the ad content.

I thought of using a condition that will take the content only if its not in that <div class="ads"> but couldn’t manage to find such.

Here is a similar example for the webpage I am working with. It is a simplified version of the webpage but the problem is the same.

JavaScript

Here is the code snippet I am using to scrape the data from the webpage

JavaScript

What I want

JavaScript

What I am getting

JavaScript

I want to take the content of the paragraphs or “p” tags without taking the content of a div inside it. Maybe this is a very easy problem but I am trying for quite some days.

Advertisement

Answer

You can remove unwanted tag using .extract():

JavaScript
User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement