Tag: pyspark

PySpark Incremental Count on Condition

Given a Spark dataframe with the following columns I am trying to construct an incremental/running count for each id based on when the contents of the event column evaluate to True. Here a new column called results would be created that contained the incremental count. I’ve tried using window functions but am stumped at this point. Ideally, the solution would

Combining WHEN and aggregation functions

apache-spark apache-spark-sql pyspark python

I need to convert this pyspark SQL code sample: Into a fully dataframe code without SQL expression. I tried: TypeError: condition should be a Column But obviously, it’s not working. What am I doing wrong? Any suggestion will be appreciated! Answer Use isNull to check, not is None:

Filter expected value from list in df column

apache-spark pandas pyspark python

I have a data frame with the following column: I want to return a column with single value based on a conditional statement. I wrote the following function: When running the function on the column df.withColumn(“col”, filter_func(“raw_col”)) I have the following error col should be Column What’s wrong here? What should I do? Answer You can use array_contains function: But

Databricks – How to pass accessToken to spark._sc._gateway.jvm.java.sql.DriverManager?

apache-spark databricks pyspark python

I would like to use databricks to run some custom SQL by below function, May I know how to add the “accessToken” as properties? It return: Thanks! Answer It doesn’t work because DriverManager doesn’t have the function that accepts HashMap that is created from Python dict – it has the function that accepts Properties object. You can create instance of

How to use Selenium in Databricks and accessing and moving downloaded files to mounted storage and keep Chrome and ChromeDriver versions in sync?

azure-databricks databricks pyspark python selenium

I’ve seen a couple of posts on using Selenium in Databricks using %shto install Chrome Drivers and Chrome. This works fine for me, but I had a lot of trouble when I needed to download a file. The file would download, but I could not find it in the filesystem in databricks. Even if I changed the download path when

pyspark – filter rows containing set of special characters

apache-spark dataframe pyspark python special-characters

I have a data frame as follow:- Now I want to find the count of total special characters present in each column. So I have used str. contains function to find it, though it is running but it does not find the special characters. Answer You may want to use rlike instead of contains, which allows to search for regular

How can I generate the same UUID for multiple dataframes in spark?

azure-databricks pyspark python

I have a df that I read from a file Then I give it a UUID column Now I create a view Now I create two new dataframes that take data from the view, both dataframes will use the original UUID column. All 3 dataframes will have different UUIDs, is there a way to keep them the same across each

script to get the file last modified date and file name pyspark

azure-databricks databricks pyspark python

I have a mount point location which is pointing to a blob storage where we have multiple files. We need to find the last modified date for a file along with the file name. I am using the below script and the list of files are as below: Answer If you’re using operating system-level commands to get file information, then

Importing count() data for use within bokeh

bokeh databricks dataframe pyspark python

I am trying to create a visualisation using the bokeh package which I have imported into the Databricks environment. I have transformed the data from a raw data frame into something resembling the following (albeit much larger): From there, I wish to create a line graph using the bokeh package to show the number of papers released per month (for