I’m trying to scrape the job postings from the following public website:
https://newbraunfels.tedk12.com/hire/Index.aspx
I know there are a few similar questions on here, but I’ve followed all of them and can’t seem to figure it out as my javascript/html skills are limited.
I can get the first page with no issues, but can’t seem to access the following three pages.
My best attempt is the following, but it still only returns the first page of listings:
import requests from bs4 import BeautifulSoup soup = BeautifulSoup(requests.get(url).content, "html.parser") def load_page(soup, page_num): payload = { "__EVENTTARGET": "", "__EVENTARGUMENT": "PageIndexNumber${}".format(page_num), } for inp in soup.select("input"): payload[inp["name"]] = inp.get("value") soup = BeautifulSoup(requests.post(url, data=payload).content, "lxml") return soup # print hospitals from first page: for jobs in soup.select("table"): print(jobs.text) # load second page soup = load_page(soup, 2) for jobs in soup.select("table"): print(jobs.text)
Thank you in advanced.
Advertisement
Answer
An easier approach in this case might be to query each page directly using get variables. The “StartIndex” variable should be a multiple of 50, as 50 results show on each page. Just increment it by 50 for each page of results to you want to scrape.
..etc.
The returned object is XML, so you will also need to import the document tree into beautiful soup so that you can target elements normally. See here for an example: