I’m trying to collect news articles off yahoo finance using selenium. I got it to work with 1 article, but when I try looping over the different articles it doesn’t make the click. The reason I have the second ‘except’ ‘continue’ is because there are ads in between the articles which I don’t want to click. The structure of the XPath of the articles are either somthingdiv[1] or somethingdiv[2] and the ‘li[]’ part differs for every article (+1 for every article). Does someone have any idea what I’m doing wrong? Or does someone have a better way to do this?
Here is my current code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
def getNewsArticles(number):
newslist = []
driver = webdriver.Chrome("C:\Program Files (x86)\chromedriver.exe")
driver.get('https://finance.yahoo.com/topic/stock-market-news')
time.sleep(2)
driver.find_element_by_xpath('//*[@id="consent-page"]/div/div/div/div[2]/div[2]/form/button').click()
for x in range(1,number+1):
if x != 1:
driver.get('https://finance.yahoo.com/topic/stock-market-news')
time.sleep(4)
try:
driver.find_element_by_xpath('//*[@id="Fin-Stream"]/ul/li[{}}]/div/div/div[1]/h3/a'.format(x)).click()
time.sleep(2)
except:
try:
driver.find_element_by_xpath('//*[@id="Fin-Stream"]/ul/li[{}}]/div/div/div[2]/h3/a'.format(x)).click()
time.sleep(2)
except:
continue
text = driver.find_element_by_class_name('caas-body').text()
newslist.append(text)
return newslist
def main():
getNewsArticles(5)
if __name__ == "__main__":
main()
Advertisement
Answer
I suggest you to look do it using BeatuifulSoup.
You just need to scape the webpage. Selenium is useful when you need a real browser to click, input values, navigate to other pages ect…
With BeatifulSoup it will be easyer and faster.
You can do something like this:
import requests
from bs4 import BeautifulSoup
news = []
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
}
url = 'https://finance.yahoo.com/topic/stock-market-news'
response = requests.get(url, headers = headers)
soup = BeautifulSoup(response.content, 'html.parser')
soup.encode('utf-8')
title = soup.find_all('h3')
for i in title:
news.append(i.get_text())
This is just an example. You can scrape whatever you want like link to the articles etc…