Tag: pyspark

Get the list of files inside a folder that have a date inside their name

I have a folder with some files inside called in this way: I would like to get a list containing all the files that have in their name a date in the format ‘YYYYMMDDHHMMSS’. so in this example I would like to get all the files except file_4_20200109999999 because the date in it does not exist. expected output: list =[file_1_20200101235900,

PySpark Data Visualization from String Values in Columns

apache-spark apache-spark-sql dataframe pyspark python

I have a table which has the information as shown in the table from a Pyspark dataframe I need to perform a data visualization by plotting the number of completed studies each month in a given year. I am of the opinion that each completed (taken from the status column) will be matched against each of the months of the

Using .withColumn on all remaining columns in DF

databricks dataframe pyspark python python-3.x

I want to anonymize or replace almost all columns in a pyspark dataframe except a few ones. I know its possible to do something like: However, doing this for all columns is a tedious process. I would rather want to do something along the lines of this: This does however not seem to work. Is there other work arounds that

Python: How to move files in a structured folder based on year/month/date format?

amazon-s3 pyspark python python-3.x

Currently I have a spark job that reads the file, creates a dataframe, does some transformations and then move those records in “year/month/date” format. I am achieving this by: I want to achieve the same by pythonic way. So, in the end it should look like: Answer Based on your question , instead of using partitionBy you can also modify

Filter pyspark DataFrame by string match

dataframe pandas pyspark python

i would like check substring match between comments and keyword column and find if anyone of the keywords present in that particular row. input expected output Answer The most efficient here is to loop, you can use set intersection: Output: Used input: With a minor variation you could check for substring match (“activ” would match “activateds”): Output:

Most efficient way of applying a function based on condition

apache-spark pyspark python

Suppose we have a master dictionary master_dict = {“a”: df1, “b”: df2, “c”: df3}. Now suppose we have a list called condition_list. Suppose func is a function that returns a new dictionary that has the original keys of master_dict along with potentially new keys. What is the best way to get the below code to work when the length of

How do I correctly add worker nodes to my cluster?

cluster-computing google-cloud-platform pyspark python

I am trying to create a cluster with the following parameters on Google Cloud: 1 Master 7 Worker nodes Each of them with 1 vCPU The master node should get full SSD capacity and the worker nodes should get equal shares of standard disk capacity. This is my code: This is my error: Updated attempt: I don’t follow what I

Pyspark: regex search with text in a list withColumn

apache-spark pyspark python regex

I am new to Spark and I am having a silly “what’s-the-best-approach” issue. Basically, I have a map(dict) that I would like to loop over. During each iteration, I want to search through a column in a spark dataframe using rlike regex and assign the key of the dict to a new column using withColumn The data sample is shown

How to select rows from list in PySpark

apache-spark pyspark python

Suppose we have two dataframes df1 and df2 where df1 has columns [a, b, c, p, q, r] and df2 has columns [d, e, f, a, b, c]. Suppose the common columns are stored in a list common_cols = [‘a’, ‘b’, ‘c’]. How do you join the two dataframes using the common_cols list within a sql command? The code below

How to convert JSON data inside a spark dataframe into new columns

apache-spark apache-spark-sql pyspark python

I have a spark dataframe like I want to convert the JSON (string) to new columns I don’t want to manually specify the keys from JSON as there are more than 100 keys Answer