I’m trying to retrieve the html of the webpage, click the next button, then repeat that action until the last page is reached. I want to get all of the articles’ headlines (h2) by the way, only managed to retrieve some portion of it. Here is my code :
import time from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.ui import WebDriverWait as Wait from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.chrome.options import Options from selenium.common.exceptions import NoSuchElementException options = Options() driver = webdriver.Chrome("C:/Users/krish/Desktop/chromedriver_win32/chromedriver.exe") driver.get('https://www.cnbcindonesia.com/tag/pasar-modal') while True: try: time.sleep(4) driver.execute_script("arguments[0].click();", WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".icon.icon-angle-right")))) except NoSuchElementException: break doc = driver.page_source from bs4 import BeautifulSoup as bs html = doc soup = bs(html, 'html.parser') for word in soup.find_all('h2'): find_all_title = word.get_text() print(find_all_title)
Here is the result
Penerbitan Obligasi Korporasi di Kuartal I Capai Rp 30 T Waskita Karya Terbitkan Obligasi Rp 3,45 Triliun Baru IPO, Direksi dan Komisaris Kioson Mengundurkan Diri Rekor Baru IHSG Berpotensi Pecah Lagi Hari Ini Kelas BPJS Kesehatan Dihapus Juli, Iuran Barunya Jadi Segini? Pemerintahan Israel Akan Dibubarkan, Apa yang Terjadi? Massa Geruduk Rumah Yusuf Mansur Terkait Investasi Batu Bara Harga Batu Bara Terbang 6% Lebih! Mau Cuan? Coba Cermati Saham Pilihan Berikut Ini
As you can see, it only retrieves a few titles
Thank you!
Advertisement
Answer
Iterate over https://www.cnbcindonesia.com/tag/pasar-modal/$var?kanal=&tipe=
The website you want to scrap is paginated, so you need to iterate over the pages. You cannot just hit the main page (https://www.cnbcindonesia.com/tag/pasar-modal) and get all the data because some of the data is paginated.
Change the $var with page number and set that page as the weblink you want to scrap.