How to use Selenium in Databricks and accessing and moving downloaded files to mounted storage and keep Chrome and ChromeDriver versions in sync?

Question

I&#8217;ve seen a couple of posts on using Selenium in Databricks using %shto install Chrome Drivers and Chrome. This works fine for me, but I had a lot of trouble when I needed to download a file. The file would download, but I could not find it in the filesystem in databricks. Even if I changed the download…

Accepted Answer

Here is the guide to installing Selenium, Chrome, and ChromeDriver. This will also move a file after downloading via Selenium to your mounted storage. Each number should be in its own cell.Install Selenium%pip install seleniumDo your importsimport pickle as pklfrom selenium import webdriverfrom selenium.webdriver.chrome.options import OptionsDownload the latest ChromeDriver to the DBFS root storage /tmp/. The curl command will get the latest Chrome version and store in the version variable. Note the escape before the $.%shversion=`curl -sS https://chromedriver.storage.googleapis.com/LATEST_RELEASE`wget -N https://chromedriver.storage.googleapis.com/${version}/chromedriver_linux64.zip -O /tmp/chromedriver_linux64.zipUnzip the file to a new folder in the DBFS root /tmp/. I tried to use non-root path and it does not work.%shunzip /tmp/chromedriver_linux64.zip -d /tmp/chromedriver/Get the latest Chrome download and install it.%shsudo curl -sS -o - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key addsudo echo "deb https://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.listsudo apt-get -y updatesudo apt-get -y install google-chrome-stable** Steps 3 – 5 can be combined into one command. You can also use the following to create a shell script and use it as an init file to configure for your clusters and is especially useful when using job clusters which use transient clusters because init scripts apply to all worker nodes rather than just the driver node. This also installs Selenium, allowing you to skip step 1. Just paste in one cell in a new notebook, run, then point your init script to dbfs:/init/init_selenium.sh. Now every time the cluster or transient cluster spins up, this will install Chrome, ChromeDriver, and Selenium on all worker nodes before your job begins to run.%sh# dbfs:/init/init_selenium.shcat > /dbfs/init/init_selenium.sh <> /etc/apt/sources.list.d/google-chrome.listsudo apt-get -y updatesudo apt-get -y install google-chrome-stablepip install seleniumEOFcat /dbfs/init/init_selenium.shConfigure your storage account. Example is for Azure Blob Storage using ADLSGen2.service_principal_id = "YOUR_SP_ID"service_principle_key = "YOUR_SP_KEY"tenant_id = "YOUR_TENANT_ID"directory = "https://login.microsoftonline.com/" + tenant_id + "/oauth2/token"configs = {"fs.azure.account.auth.type": "OAuth", "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider", "fs.azure.account.oauth2.client.id": service_principal_id, "fs.azure.account.oauth2.client.secret": service_principle_key, "fs.azure.account.oauth2.client.endpoint": directory, "fs.azure.createRemoteFileSystemDuringInitialization": "true"}Configure your mounting location and mount.mount_point = "/mnt/container-data/"mount_point_main = "/dbfs/mnt/container-data/"container = "container-data"storage_account = "adlsgen2"storage = "abfss://"+ container +"@"+ storage_account + ".dfs.core.windows.net"utils_folder = mount_point + "utils/selenium/"raw_folder = mount_point + "raw/"if not any(mount_point in mount_info for mount_info in dbutils.fs.mounts()): dbutils.fs.mount( source = storage, mount_point = mount_point, extra_configs = configs) print(mount_point + " has been mounted.")else: print(mount_point + " was already mounted.")print(f"Utils folder: {utils_folder}")print(f"Raw folder: {raw_folder}")Create method for instantiating Chrome browser. I need to load in a cookies file that I have placed in my utils folder which points to mnt/container-data/utils/selenium. Make sure the arguments are the same (no sandbox, headless, disable-dev-shm-usage)def init_chrome_browser(download_path, chrome_driver_path, cookies_path, url): """ Instatiates a Chrome browser. Parameters ---------- download_path : str The download path to place files downloaded from this browser session. chrome_driver_path : str The path of the chrome driver executable binary (.exe file). cookies_path : str The path of the cookie file to load in (.pkl file). url : str The URL address of the page to initially load. Returns ------- Browser Returns the instantiated browser object. """ options = Options() prefs = {'download.default_directory' : download_path} options.add_experimental_option('prefs', prefs) options.add_argument('--no-sandbox') options.add_argument('--headless') options.add_argument('--disable-dev-shm-usage') options.add_argument('--start-maximized') options.add_argument('window-size=2560,1440') print(f"{datetime.now()} Launching Chrome...") browser = webdriver.Chrome(service=Service(chrome_driver_path), options=options) print(f"{datetime.now()} Chrome launched.") browser.get(url) print(f"{datetime.now()} Loading cookies...") cookies = pkl.load(open(cookies_path, "rb")) for cookie in cookies: browser.add_cookie(cookie) browser.get(url) print(f"{datetime.now()} Cookies loaded.") print(f"{datetime.now()} Browser ready to use.") return browserInstatiate browser. Set the downloads location to the DBFS root file system /tmp/downloads. Make sure the cookies path has /dbfs in front so the full cookies path is like /dbfs/mnt/...browser = init_chrome_browser( download_path="/tmp/downloads", chrome_driver_path="/tmp/chromedriver/chromedriver", cookies_path="/dbfs"+ utils_folder + "cookies.pkl", url="YOUR_URL")Do your navigating and any downloads you need.OPTIONAL: Examine your download location. In this example, I downloaded a CSV file and will search through the downloaded folder until I find that file format.import osimport os.pathfor root, directories, filenames in os.walk('/tmp'): print(root) if any(".csv" in s for s in filenames): print(filenames) breakCopy the file from DBFS root tmp to your mounted storage (/mnt/container-data/raw/). You can rename during this operation as well. You can only access root file system using file: prefix when using dbutils.dbutils.fs.cp("file:/tmp/downloads/file1.csv", f"{raw_folder}file2.csv')

Advertisement

Answer