I am new to web scraping. I want to scrape the data (comments and respective dates) from this web page https://forum.donanimhaber.com/turkcell-25-000-dk-12-000-sms-70-gb-internet-12-ay-320-tl-ana-konu–151777938 It has pagination for pages…. This is the way I am doing
from playwright.sync_api import sync_playwright from bs4 import BeautifulSoup import json AllEntries = [] with sync_playwright() as p: browser = p.chromium.launch(headless=False,slow_mo=50) noofforumpagesvodafone = 1000 currentpage = 1 page = browser.new_page() page.goto('https://search.donanimhaber.com/?q=vodafone&p='+ str(currentpage) + '&token=-1&hash=56BB9D1746DBCDA94D0B1E5825EFF47D&order=date&in=all&type=both&scope=all&range=all',timeout = 0) html = page.inner_html("div.results") soup = BeautifulSoup(html, 'html.parser') xx = [x.get('href') for x in soup.find_all('a')] xxi = 0 time = [] while(xxi<1): if(xx[xxi][0] == "/"): entry = [] # page.goto('https://search.donanimhaber.com' + str(xx[xxi]),timeout = 0) page.goto("https://forum.donanimhaber.com/turkcell-25-000-dk-12-000-sms-70-gb-internet-12-ay-320-tl-ana-konu--151777938") html = page.inner_html("div.kl-icerik") soup = BeautifulSoup(html, 'html.parser') for table in soup.findAll('div', {'class': 'ki-cevapicerigi'}): for t in table.findAll('span', {'class': 'mButon info'}): print(t.text) for links in table.findAll('span', {'class': 'msg'}): for link in links.findAll('td'): print(link.text) for linko in links.findAll('p'): print(linko.text)
This code is working only on first page its give all comments and dates accordingly..but not from page 2.3.4….. which appears as we scroll to the buttom
How can I do that …Thank you
Advertisement
Answer
In your special case, each page has their own link. It is your base link and the page number with an hyphen (-) in between.
You can see this behaviour when clicking on the second page, compare your base-link with the link you have now: https://forum.donanimhaber.com/turkcell-25-000-dk-12-000-sms-70-gb-internet-12-ay-320-tl-ana-konu–151777938-2
(notice the -2 at the end)
One way to do it, would be to change your url in a for-loop, iterating up to 24 and scrape all of those pages individually.