Skip to content
Advertisement

How to use playwright and beautifulsoup on web page which has pagination?

I am new to web scraping. I want to scrape the data (comments and respective dates) from this web page https://forum.donanimhaber.com/turkcell-25-000-dk-12-000-sms-70-gb-internet-12-ay-320-tl-ana-konu–151777938 It has pagination for pages…. This is the way I am doing

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
import json
AllEntries = []
with sync_playwright() as p:
    browser = p.chromium.launch(headless=False,slow_mo=50)
    noofforumpagesvodafone = 1000
    currentpage = 1
    page = browser.new_page()
    page.goto('https://search.donanimhaber.com/?q=vodafone&p='+ str(currentpage) + '&token=-1&hash=56BB9D1746DBCDA94D0B1E5825EFF47D&order=date&in=all&type=both&scope=all&range=all',timeout = 0)
    html = page.inner_html("div.results")
    soup = BeautifulSoup(html, 'html.parser')
    xx = [x.get('href') for x in soup.find_all('a')]

    xxi = 0
    time = []
    while(xxi<1):
        if(xx[xxi][0] == "/"):
            entry = []
            # page.goto('https://search.donanimhaber.com' + str(xx[xxi]),timeout = 0)
            page.goto("https://forum.donanimhaber.com/turkcell-25-000-dk-12-000-sms-70-gb-internet-12-ay-320-tl-ana-konu--151777938")

            html = page.inner_html("div.kl-icerik")
            soup = BeautifulSoup(html, 'html.parser')

            for table in soup.findAll('div', {'class': 'ki-cevapicerigi'}):
                for t in table.findAll('span', {'class': 'mButon info'}):
                    print(t.text)

                for links in table.findAll('span', {'class': 'msg'}):
                     for link in links.findAll('td'):
                          print(link.text)
                     for linko in links.findAll('p'):
                          print(linko.text)

This code is working only on first page its give all comments and dates accordingly..but not from page 2.3.4….. which appears as we scroll to the buttom

How can I do that …Thank you

Advertisement

Answer

In your special case, each page has their own link. It is your base link and the page number with an hyphen (-) in between.

You can see this behaviour when clicking on the second page, compare your base-link with the link you have now: https://forum.donanimhaber.com/turkcell-25-000-dk-12-000-sms-70-gb-internet-12-ay-320-tl-ana-konu–151777938-2

(notice the -2 at the end)

One way to do it, would be to change your url in a for-loop, iterating up to 24 and scrape all of those pages individually.

User contributions licensed under: CC BY-SA
6 People found this is helpful
Advertisement