I’m trying to collect news articles off yahoo finance using selenium. I got it to work with 1 article, but when I try looping over the different articles it doesn’t make the click. The reason I have the second ‘except’ ‘continue’ is because there are ads in between the articles which I don’t want to click. The structure of the XPath of the articles are either somthingdiv[1] or somethingdiv[2] and the ‘li[]’ part differs for every article (+1 for every article). Does someone have any idea what I’m doing wrong? Or does someone have a better way to do this?
Here is my current code:
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys def getNewsArticles(number): newslist = [] driver = webdriver.Chrome("C:\Program Files (x86)\chromedriver.exe") driver.get('https://finance.yahoo.com/topic/stock-market-news') time.sleep(2) driver.find_element_by_xpath('//*[@id="consent-page"]/div/div/div/div[2]/div[2]/form/button').click() for x in range(1,number+1): if x != 1: driver.get('https://finance.yahoo.com/topic/stock-market-news') time.sleep(4) try: driver.find_element_by_xpath('//*[@id="Fin-Stream"]/ul/li[{}}]/div/div/div[1]/h3/a'.format(x)).click() time.sleep(2) except: try: driver.find_element_by_xpath('//*[@id="Fin-Stream"]/ul/li[{}}]/div/div/div[2]/h3/a'.format(x)).click() time.sleep(2) except: continue text = driver.find_element_by_class_name('caas-body').text() newslist.append(text) return newslist def main(): getNewsArticles(5) if __name__ == "__main__": main()
Advertisement
Answer
I suggest you to look do it using BeatuifulSoup.
You just need to scape the webpage. Selenium is useful when you need a real browser to click, input values, navigate to other pages ect…
With BeatifulSoup it will be easyer and faster.
You can do something like this:
import requests from bs4 import BeautifulSoup news = [] headers = { "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36" } url = 'https://finance.yahoo.com/topic/stock-market-news' response = requests.get(url, headers = headers) soup = BeautifulSoup(response.content, 'html.parser') soup.encode('utf-8') title = soup.find_all('h3') for i in title: news.append(i.get_text())
This is just an example. You can scrape whatever you want like link to the articles etc…