I am trying to scrape a website like this: https://seeksophie.com/options/1-5hr-basic-candle-workshop
. From this website, I’d like to get all date schedules (for 1 year) for the activity, and all of dates in the website are in form of span component. It is important for me to get notAllowed
and flatpickr-disabled
class from the component as I will have to filter available dates from all of them by using those attributes. While on that, I have to also try to get all the times available for a certain date (helps will be very much appreciated), but I think that getting the span is the priority first.
My approach for this is to iteratively clicking the next month button and get all spans while on it. Something like this:
def find_all_span(self, soup): new_soup = soup.__copy__() all_spans = [] for i in range(12): days_container = new_soup.find_all("div", {"class": "dayContainer"}) spans = days_container[2].find_all("span") all_spans.extend(spans) next_month_clicker = self.page_loader.driver.find_element_by_id( "js-placeholder-booking-form-accommodation-date") self.page_loader.driver.execute_script("arguments[0].click();", next_month_clicker) next_month_clicker = self.page_loader.driver.find_elements_by_class_name("flatpickr-next-month") self.page_loader.driver.execute_script("arguments[0].click();", next_month_clicker[2]) page_response = self.page_loader.driver.page_source new_soup = BeautifulSoup(page_response, 'html.parser') for span in spans: print(span["aria-label"]) return list(set(all_spans))
Note that the soup is exactly what BeautifulSoup page response with HTML Parser. This will only generate all spans within approximately a month, and the click won’t change the page response to get more spans in next months. What can I do to solve this? Any other approach will also be okay.
Advertisement
Answer
Finally, after 3 hours :) I don’t want/going to explain all the wrong things that you have done in your script, but I am going to explain my code.
I have to execute all these JavaScript
because website is not allowing me to click in next month button.(ie. If it works fine without executing these scripts then you may delete that JavaScript
line). You are using html.parser
as parser but I am using lxml
because it is faster then html.parser
and other things are straight forward, just clicking on next month button and scraping span
s from source code. You can now do other things with these spans.
Here’s code
driver.get('https://seeksophie.com/options/1-5hr-basic-candle-workshop') driver.execute_script("""document.querySelector("#js-booking-bottom-bar").remove()""") n=driver.find_element_by_xpath("/html/body/div[3]/div[4]/div[5]/div/div/div[2]/div/div[1]/div/div[2]/div[1]/span[2]") all_spans=[] for i in range(12): page=driver.page_source soup=BeautifulSoup(page,"lxml") all_spans.extend(soup.find_all("div",class_="dayContainer")[1].find_all("span")) try: driver.execute_script("""document.querySelector("#js-modal-first-order-bonus").remove()""") driver.execute_script("""document.querySelector(".modal-backdrop").remove()""") except: pass n.click() print(all_spans)
And finally if it helps you with your problem then don’t forget to mark this as answer.