I have a folder with some files inside called in this way: I would like to get a list containing all the files that have in their name a date in the format ‘YYYYMMDDHHMMSS’. so in this example I would like to get all the files except file_4_20200109999999 because the date in it does not exist. expected output: list =[file_1_20200101235900,
Tag: pyspark
PySpark Data Visualization from String Values in Columns
I have a table which has the information as shown in the table from a Pyspark dataframe I need to perform a data visualization by plotting the number of completed studies each month in a given year. I am of the opinion that each completed (taken from the status column) will be matched against each of the months of the
Using .withColumn on all remaining columns in DF
I want to anonymize or replace almost all columns in a pyspark dataframe except a few ones. I know its possible to do something like: However, doing this for all columns is a tedious process. I would rather want to do something along the lines of this: This does however not seem to work. Is there other work arounds that
Python: How to move files in a structured folder based on year/month/date format?
Currently I have a spark job that reads the file, creates a dataframe, does some transformations and then move those records in “year/month/date” format. I am achieving this by: I want to achieve the same by pythonic way. So, in the end it should look like: Answer Based on your question , instead of using partitionBy you can also modify
Filter pyspark DataFrame by string match
i would like check substring match between comments and keyword column and find if anyone of the keywords present in that particular row. input expected output Answer The most efficient here is to loop, you can use set intersection: Output: Used input: With a minor variation you could check for substring match (“activ” would match “activateds”): Output:
Most efficient way of applying a function based on condition
Suppose we have a master dictionary master_dict = {“a”: df1, “b”: df2, “c”: df3}. Now suppose we have a list called condition_list. Suppose func is a function that returns a new dictionary that has the original keys of master_dict along with potentially new keys. What is the best way to get the below code to work when the length of
How do I correctly add worker nodes to my cluster?
I am trying to create a cluster with the following parameters on Google Cloud: 1 Master 7 Worker nodes Each of them with 1 vCPU The master node should get full SSD capacity and the worker nodes should get equal shares of standard disk capacity. This is my code: This is my error: Updated attempt: I don’t follow what I
Pyspark: regex search with text in a list withColumn
I am new to Spark and I am having a silly “what’s-the-best-approach” issue. Basically, I have a map(dict) that I would like to loop over. During each iteration, I want to search through a column in a spark dataframe using rlike regex and assign the key of the dict to a new column using withColumn The data sample is shown
How to select rows from list in PySpark
Suppose we have two dataframes df1 and df2 where df1 has columns [a, b, c, p, q, r] and df2 has columns [d, e, f, a, b, c]. Suppose the common columns are stored in a list common_cols = [‘a’, ‘b’, ‘c’]. How do you join the two dataframes using the common_cols list within a sql command? The code below
How to convert JSON data inside a spark dataframe into new columns
I have a spark dataframe like I want to convert the JSON (string) to new columns I don’t want to manually specify the keys from JSON as there are more than 100 keys Answer