I have a txt file filed with multiple urls, each url is an article with text and their corresponding SDG (example of one article 1)
The text parts of an article are in balises ‘div.text.-normal.content’ and then in ‘p’ And the SDGs are in ‘div.tax-section.text.-normal.small’ and then in ‘span’
To extract them I use the following lines of code :
data = [] with open('urls_news.txt', 'r') as inf: for row in inf: url = row.strip() response = requests.get(url, headers={'User-agent': 'Mozilla/5.0'}) if response.ok: try: soup = BeautifulSoup(response.text,"html.parser") text = soup.select_one('div.text-normal').get_text(strip=True) topic = soup.select_one('div.tax-section').get_text(strip=True) data.append( { 'text':text, 'topic': topic, } ) pd.DataFrame(data).to_excel('text_2.xlsx', index = False, header=True) except AttributeError: print (" ") time.sleep(3)
But I have no result, I’ve previously used this code to extract same type of information from an other website with clearer class name. I’va also tried to enter “div.text.-normal.content” and “div.tax-section.text.-normal.small” but same result.
I think that the classes i’m calling in this exemple are wrong. I would like to know what i’ve missed in theses classes names.
Advertisement
Answer
To select the text you can go with:
soup.select_one('div.text.-normal.content').get_text(strip=True)
Think there is something wrong with the names of the classes, just chain them with a .
for every whitespace between them.
or:
soup.select_one('div.c-single-content').get_text(strip=True)
To get the topics as mentioned you can go with:
'^^'.join([topic.get_text(strip=True) for topic in soup.select_one('div.tax-section.text.-normal.small').select('a')])