Skip to content
Advertisement

Problem with looping over XPaths selenium

I’m trying to collect news articles off yahoo finance using selenium. I got it to work with 1 article, but when I try looping over the different articles it doesn’t make the click. The reason I have the second ‘except’ ‘continue’ is because there are ads in between the articles which I don’t want to click. The structure of the XPath of the articles are either somthingdiv[1] or somethingdiv[2] and the ‘li[]’ part differs for every article (+1 for every article). Does someone have any idea what I’m doing wrong? Or does someone have a better way to do this?

Here is my current code:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys


def getNewsArticles(number):
    newslist = []
    driver = webdriver.Chrome("C:\Program Files (x86)\chromedriver.exe")
    driver.get('https://finance.yahoo.com/topic/stock-market-news')
    time.sleep(2)
    driver.find_element_by_xpath('//*[@id="consent-page"]/div/div/div/div[2]/div[2]/form/button').click()

 
    for x in range(1,number+1):
        if x != 1:
            driver.get('https://finance.yahoo.com/topic/stock-market-news')
            time.sleep(4)
        try:
            driver.find_element_by_xpath('//*[@id="Fin-Stream"]/ul/li[{}}]/div/div/div[1]/h3/a'.format(x)).click()
            time.sleep(2)
        except:
            try: 
                driver.find_element_by_xpath('//*[@id="Fin-Stream"]/ul/li[{}}]/div/div/div[2]/h3/a'.format(x)).click()
                time.sleep(2)
            except:
                continue

        text = driver.find_element_by_class_name('caas-body').text()
        newslist.append(text)
    return newslist




    

def main():
    getNewsArticles(5)
     


if __name__ == "__main__":
    main()

Advertisement

Answer

I suggest you to look do it using BeatuifulSoup.

You just need to scape the webpage. Selenium is useful when you need a real browser to click, input values, navigate to other pages ect…

With BeatifulSoup it will be easyer and faster.

You can do something like this:

import requests
from bs4 import BeautifulSoup

news = []

headers = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
        }

url = 'https://finance.yahoo.com/topic/stock-market-news'

response = requests.get(url, headers = headers)
soup = BeautifulSoup(response.content, 'html.parser')
soup.encode('utf-8')

title = soup.find_all('h3')


for i in title:
       news.append(i.get_text())

This is just an example. You can scrape whatever you want like link to the articles etc…

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement