I’m currently trying to extract text and labels (Topics) from a webpage with the following code :
Texts = [] Topics = [] url = 'https://www.unep.org/news-and-stories/story/yes-climate-change-driving-wildfires' response = requests.get(url, headers={'User-agent': 'Mozilla/5.0'}) if response.ok: soup = BeautifulSoup(response.text,'lxml') txt = soup.findAll('div', {'class': 'para_content_text'}) for div in txt: p = div.findAll('p') Texts.append(p) print(Texts) top = soup.find('div', {'class': 'article_tags_topics'}) a = top.findAll('a') Topics.append(a) print(Topics)
No code problem, but here is an extract of what I’ve obtained with the previous code :
</p>, <p><strong>UNEP:</strong> And this is bad news?</p>, <p><strong>NH:</strong> This is bad news. This is bad for our health, for our wallet and for the fabric of society.</p>, <p><strong>UNEP:</strong> The world is heading towards a global average temperature that’s 3<strong>°</strong>C to 4<strong>°</strong>C higher than it was before the industrial revolution. For many people, that might not seem like a lot. What do you say to them?</p>, <p><strong>NH:</strong> Just think about your own body. When your temperature goes up from 36.7°C (98°F) to 37.7°C (100°F), you’ll probably consider taking the day off. If it goes 1.5°C above normal, you’re staying home for sure. If you add 3°C, people who are older and have preexisting conditions – they may die. The tolerances are just as tight for the planet.</p>]] [[<a href="/explore-topics/forests">Forests</a>, <a href="/explore-topics/climate-change">Climate change</a>]]
As I’m looking for a “clean” text result I tried to add the following code line in my loops in order to only obtain text :
p = p.text
but I got :
AttributeError: ResultSet object has no attribute ‘text’. You’re probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
I’ve also notice that for Topic result I got un unwanted URL, I would like to only obtain Forest and results (without coma between them).
Any idea of what can I add to my code to obtain clean text and topic ?
Advertisement
Answer
This happens because p
is a ResultSet
object. You can see this by running the following:
print(type(Texts[0]))
Output:
<class 'bs4.element.ResultSet'>
To get the actual text, you can address each item in each ResultSet
directly:
for result in Texts: for item in result: print(item.text)
Output:
As wildfires sweep across the western United States, taking lives, destroying homes and blanketing the country in smoke, Niklas Hagelberg has a sobering message: this could be America’s new normal. ......
Or even use a list comprehension:
full_text = 'n'.join([item.text for result in Texts for item in result])