Skip to content
Advertisement

Selenium – Retrieving html from first page until last page

I’m trying to retrieve the html of the webpage, click the next button, then repeat that action until the last page is reached. I want to get all of the articles’ headlines (h2) by the way, only managed to retrieve some portion of it. Here is my code :

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait as Wait
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException

options = Options()

driver = webdriver.Chrome("C:/Users/krish/Desktop/chromedriver_win32/chromedriver.exe")
driver.get('https://www.cnbcindonesia.com/tag/pasar-modal')

while True:
    try:
        time.sleep(4)
        driver.execute_script("arguments[0].click();", WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".icon.icon-angle-right"))))
    except NoSuchElementException:
        break

doc = driver.page_source

from bs4 import BeautifulSoup as bs

html = doc
soup = bs(html, 'html.parser')

for word in soup.find_all('h2'):
    find_all_title = word.get_text()
    print(find_all_title)

Here is the result

Penerbitan Obligasi Korporasi di Kuartal I Capai Rp 30 T
Waskita Karya Terbitkan Obligasi Rp 3,45 Triliun
Baru IPO, Direksi dan Komisaris Kioson Mengundurkan Diri
Rekor Baru IHSG Berpotensi Pecah Lagi Hari Ini
Kelas BPJS Kesehatan Dihapus Juli, Iuran Barunya Jadi Segini?
Pemerintahan Israel Akan Dibubarkan, Apa yang Terjadi?
Massa Geruduk Rumah Yusuf Mansur Terkait Investasi Batu Bara
Harga Batu Bara Terbang 6% Lebih!
Mau Cuan? Coba Cermati Saham Pilihan Berikut Ini

As you can see, it only retrieves a few titles

Thank you!

Advertisement

Answer

Iterate over https://www.cnbcindonesia.com/tag/pasar-modal/$var?kanal=&tipe=

The website you want to scrap is paginated, so you need to iterate over the pages. You cannot just hit the main page (https://www.cnbcindonesia.com/tag/pasar-modal) and get all the data because some of the data is paginated.

Change the $var with page number and set that page as the weblink you want to scrap.

User contributions licensed under: CC BY-SA
3 People found this is helpful
Advertisement