I’m trying to retrieve the html of the webpage, click the next button, then repeat that action until the last page is reached. I want to get all of the articles’ headlines (h2) by the way, only managed to retrieve some portion of it. Here is my code :
JavaScript
x
32
32
1
import time
2
from selenium import webdriver
3
from selenium.webdriver.common.by import By
4
from selenium.webdriver.support import expected_conditions as EC
5
from selenium.webdriver.support.ui import WebDriverWait as Wait
6
from selenium.webdriver.support.ui import WebDriverWait
7
from selenium.webdriver.chrome.options import Options
8
from selenium.common.exceptions import NoSuchElementException
9
10
options = Options()
11
12
driver = webdriver.Chrome("C:/Users/krish/Desktop/chromedriver_win32/chromedriver.exe")
13
driver.get('https://www.cnbcindonesia.com/tag/pasar-modal')
14
15
while True:
16
try:
17
time.sleep(4)
18
driver.execute_script("arguments[0].click();", WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.CSS_SELECTOR, ".icon.icon-angle-right"))))
19
except NoSuchElementException:
20
break
21
22
doc = driver.page_source
23
24
from bs4 import BeautifulSoup as bs
25
26
html = doc
27
soup = bs(html, 'html.parser')
28
29
for word in soup.find_all('h2'):
30
find_all_title = word.get_text()
31
print(find_all_title)
32
Here is the result
JavaScript
1
10
10
1
Penerbitan Obligasi Korporasi di Kuartal I Capai Rp 30 T
2
Waskita Karya Terbitkan Obligasi Rp 3,45 Triliun
3
Baru IPO, Direksi dan Komisaris Kioson Mengundurkan Diri
4
Rekor Baru IHSG Berpotensi Pecah Lagi Hari Ini
5
Kelas BPJS Kesehatan Dihapus Juli, Iuran Barunya Jadi Segini?
6
Pemerintahan Israel Akan Dibubarkan, Apa yang Terjadi?
7
Massa Geruduk Rumah Yusuf Mansur Terkait Investasi Batu Bara
8
Harga Batu Bara Terbang 6% Lebih!
9
Mau Cuan? Coba Cermati Saham Pilihan Berikut Ini
10
As you can see, it only retrieves a few titles
Thank you!
Advertisement
Answer
Iterate over https://www.cnbcindonesia.com/tag/pasar-modal/$var?kanal=&tipe=
The website you want to scrap is paginated, so you need to iterate over the pages. You cannot just hit the main page (https://www.cnbcindonesia.com/tag/pasar-modal) and get all the data because some of the data is paginated.
Change the $var with page number and set that page as the weblink you want to scrap.