First, sorry for my poor English.
Actually, I have a script which scrapes a website to find comments in webpage, in python.
Its for scrape all messages in page, but I will want scrape just last post.
How to do this please?
Too, I will want to find web links probably posted in last message, but a full link.
Its possible?
Here is the webpage link and script:
https://www.dealabs.com/discussions/suivi-erreurs-de-prix-1063390?page=9999
#!/usr/bin/env python3 # https://www.jeuxvideo.com/forums/42-47-66784467-1-0-1-0-aide-scraping-python-forum-dealabs.htm # scraping_dealabs.py from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.chrome.options import Options from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By url = "https://www.dealabs.com/discussions/suivi-erreurs-de-prix-1063390?page=9999" options = Options() options.headless = True driver = webdriver.Chrome(options=options) driver.get(url) # Accepter les cookies button = WebDriverWait(driver, 2).until( EC.element_to_be_clickable((By.XPATH, "/html/body/main/div[4]/div[1]/div/div[1]/div[2]/button[2]/span")) ) button.click() # On recherche les commentaires et on affiche le texte comments = driver.find_elements_by_class_name("commentList-item") for comment in comments: _id = comment.get_attribute("id") author = comment.find_element_by_class_name('userInfo-username').text content = comment.find_element_by_class_name('userHtml-content').text timestamp = comment.find_element_by_class_name('text--color-greyShade').text comment_url = f"{url}#{_id}" print("Posté par", author) print(content) print("Publication:", timestamp) print("Lien du commentaire:") print(comment_url) print('-' * 30) driver.close()
Thanks for time ans reply!
Advertisement
Answer
First I’d like you to use correct locators, so instead of /html/body/main/div[4]/div[1]/div/div[1]/div[2]/button[2]/span
try using this CSS selector .btn--mode-primary.overflow--wrap-on
.
In order to get the last comment you can use this XPath: (//div[@class='commentList-item'])[last()]
So in order to get the last comment details only your code can be modified to be like this:
#!/usr/bin/env python3 # https://www.jeuxvideo.com/forums/42-47-66784467-1-0-1-0-aide-scraping-python-forum-dealabs.htm # scraping_dealabs.py from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.chrome.options import Options from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.common.action_chains import ActionChains url = "https://www.dealabs.com/discussions/suivi-erreurs-de-prix-1063390?page=9999" options = Options() options.headless = True driver = webdriver.Chrome(options=options) driver.get(url) actions = ActionChains(driver) # Accepter les cookies WebDriverWait(driver, 2).until( EC.element_to_be_clickable((By.CSS_SELECTOR, ".btn--mode-primary.overflow--wrap-on"))).click() last_comment = driver.find_element_by_xpath("(//div[@class='commentList-item'])[last()]") actions.move_to_element(last_comment).perform() time.sleep(0.5) last_comment = driver.find_element_by_xpath("(//div[@class='commentList-item'])[last()]") _id = last_comment.get_attribute("id") author = last_comment.find_element_by_xpath(".//span[contains(@class,'userInfo-username')]").text content = last_comment.find_element_by_xpath(".//*[contains(@class,'userHtml-content')]").text timestamp = last_comment.find_element_by_xpath(".//*[contains(@class,'text--color-greyShade')]").text comment_url = f"{url}#{_id}" print("Posté par", author) print(content) print("Publication:", timestamp) print("Lien du commentaire:") print(comment_url) print('-' * 30) driver.close()
UPD
To get the last element on the page, as you described in the comments, you have to change the locator from
last_comment = driver.find_element_by_xpath("(//div[@class='commentList-item'])[last()]")
to
last_comment = driver.find_element_by_xpath("(//div[@class='commentList-comment'])[last()]")
So that entire code above will be:
#!/usr/bin/env python3 # https://www.jeuxvideo.com/forums/42-47-66784467-1-0-1-0-aide-scraping-python-forum-dealabs.htm # scraping_dealabs.py from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.chrome.options import Options from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.common.action_chains import ActionChains url = "https://www.dealabs.com/discussions/suivi-erreurs-de-prix-1063390?page=9999" options = Options() options.headless = True driver = webdriver.Chrome(options=options) driver.get(url) actions = ActionChains(driver) # Accepter les cookies WebDriverWait(driver, 2).until( EC.element_to_be_clickable((By.CSS_SELECTOR, ".btn--mode-primary.overflow--wrap-on"))).click() last_comment = driver.find_element_by_xpath("(//div[@class='commentList-comment'])[last()]") actions.move_to_element(last_comment).perform() time.sleep(0.5) last_comment = driver.find_element_by_xpath("(//div[@class='commentList-comment'])[last()]") _id = last_comment.get_attribute("id") author = last_comment.find_element_by_xpath(".//span[contains(@class,'userInfo-username')]").text content = last_comment.find_element_by_xpath(".//*[contains(@class,'userHtml-content')]").text timestamp = last_comment.find_element_by_xpath(".//*[contains(@class,'text--color-greyShade')]").text comment_url = f"{url}#{_id}" print("Posté par", author) print(content) print("Publication:", timestamp) print("Lien du commentaire:") print(comment_url) print('-' * 30) driver.close()