Given a Spark dataframe with the following columns I am trying to construct an incremental/running count for each id based on when the contents of the event column evaluate to True. Here a new column called results would be created that contained the incremental count. I’ve tried using window functions but am stumped at this point. Ideally, the solution would
Tag: pyspark
Combining WHEN and aggregation functions
I need to convert this pyspark SQL code sample: Into a fully dataframe code without SQL expression. I tried: TypeError: condition should be a Column But obviously, it’s not working. What am I doing wrong? Any suggestion will be appreciated! Answer Use isNull to check, not is None:
Filter expected value from list in df column
I have a data frame with the following column: I want to return a column with single value based on a conditional statement. I wrote the following function: When running the function on the column df.withColumn(“col”, filter_func(“raw_col”)) I have the following error col should be Column What’s wrong here? What should I do? Answer You can use array_contains function: But
Databricks – How to pass accessToken to spark._sc._gateway.jvm.java.sql.DriverManager?
I would like to use databricks to run some custom SQL by below function, May I know how to add the “accessToken” as properties? It return: Thanks! Answer It doesn’t work because DriverManager doesn’t have the function that accepts HashMap that is created from Python dict – it has the function that accepts Properties object. You can create instance of
How to use Selenium in Databricks and accessing and moving downloaded files to mounted storage and keep Chrome and ChromeDriver versions in sync?
I’ve seen a couple of posts on using Selenium in Databricks using %shto install Chrome Drivers and Chrome. This works fine for me, but I had a lot of trouble when I needed to download a file. The file would download, but I could not find it in the filesystem in databricks. Even if I changed the download path when
pyspark – filter rows containing set of special characters
I have a data frame as follow:- Now I want to find the count of total special characters present in each column. So I have used str. contains function to find it, though it is running but it does not find the special characters. Answer You may want to use rlike instead of contains, which allows to search for regular
Spread List of Lists to Sparks DF with PySpark?
I’m currently struggling with following issue: Let’s take following List of Lists: How can I create following Sparks DF out of it with one row per element of each sublist: The only way I’m getting this done is by processing this list to another list with for-loops, which basically then already represents all rows of my DF, which is probably
How can I generate the same UUID for multiple dataframes in spark?
I have a df that I read from a file Then I give it a UUID column Now I create a view Now I create two new dataframes that take data from the view, both dataframes will use the original UUID column. All 3 dataframes will have different UUIDs, is there a way to keep them the same across each
script to get the file last modified date and file name pyspark
I have a mount point location which is pointing to a blob storage where we have multiple files. We need to find the last modified date for a file along with the file name. I am using the below script and the list of files are as below: Answer If you’re using operating system-level commands to get file information, then
Importing count() data for use within bokeh
I am trying to create a visualisation using the bokeh package which I have imported into the Databricks environment. I have transformed the data from a raw data frame into something resembling the following (albeit much larger): From there, I wish to create a line graph using the bokeh package to show the number of papers released per month (for