I’m trying to create a bot that will download videos from this site named “Sdarot” using selenium and python3.
Each video (or episode) in the site has a unique page and URL. When you load an episode, you have to wait 30 seconds for the episode to “load”, and only then the <video> tag appears in the HTML source file.
The problem is that the request for the video is encrypted or secured in one way or another (I don’t really understand how it works)! When I try to simply wait for the video tag to appear and then download the video with the urllib library (see code below), I get the following error: urllib.error.HTTPError: HTTP Error 401: Unauthorized
I should note that when I try to open the link of the download video in the selenium driver, it opens completely fine and I can download it manually.
How can I download the videos automatically? Thanks in advance!
Code:
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import urllib.request def load(driver, url): driver.get(url) # open the page in the browser try: # wait for the episode to "load" # if something is wrong and the episode doesn't load after 45 seconds, # the function will call itself again and try to load again. continue_btn = WebDriverWait(driver, 45).until( EC.element_to_be_clickable((By.ID, "proceed")) ) except: load(url) def save_video(driver, filename): video_element = driver.find_element_by_tag_name( "video") # get the video element video_url = video_element.get_property('src') # get the video url # trying to download the video urllib.request.urlretrieve(video_url, filename) # ERROR: "urllib.error.HTTPError: HTTP Error 401: Unauthorized" def main(): URL = r'https://www.sdarot.dev/watch/339-%D7%94%D7%A4%D7%99%D7%92-%D7%9E%D7%95%D7%AA-ha-pijamot/season/1/episode/23' DRIVER = webdriver.Chrome() load(DRIVER, URL) video_url = save_video(DRIVER, "video.mp4") if __name__ == "__main__": main()
Advertisement
Answer
You are getting unauthorised error because they are using cookies to store some information related to your session. Specifically cookie named Sdarot
. I have used requests
library to download and save the video.
Main point is when you open the url using selenium it works fine because selenium is using the same http client (the browser) which already has cookie details available, but when you call using urllib basically its different http client so its a new request for the server. To overcome this you will have to behave like the browser by providing enough session information, in this case maintained by cookies.
Check how I have extracted value of Sdarot
cookie and applied it in requests.get
method. You can do this using urllib
also.
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import requests def load(driver, url): driver.get(url) # open the page in the browser try: # wait for the episode to "load" # if something is wrong and the episode doesn't load after 45 seconds, # the function will call itself again and try to load again. continue_btn = WebDriverWait(driver, 45).until( EC.element_to_be_clickable((By.ID, "proceed")) ) continue_btn.click() except: load(driver,url) #corrected parameter error def save_video(driver, filename): video_element = driver.find_element_by_tag_name( "video") # get the video element video_url = video_element.get_property('src') # get the video url cookies = driver.get_cookies() #iterate all the cookies and extract cookie value named Sdarot for entry in cookies: if(entry["name"] == 'Sdarot'): cookies = dict({entry["name"]:entry["value"]}) #set request with proper cookies r = requests.get(video_url, cookies=cookies,stream = True) # start download with open(filename, 'wb') as f: for chunk in r.iter_content(chunk_size = 1024*1024): if chunk: f.write(chunk) def main(): URL = r'https://www.sdarot.dev/watch/339-%D7%94%D7%A4%D7%99%D7%92-%D7%9E%D7%95%D7%AA-ha-pijamot/season/1/episode/23' DRIVER = webdriver.Chrome() load(DRIVER, URL) video_url = save_video(DRIVER, "video.mp4") if __name__ == "__main__": main()