Skip to content
Advertisement

Tag: pyspark

Get the list of files inside a folder that have a date inside their name

I have a folder with some files inside called in this way: I would like to get a list containing all the files that have in their name a date in the format ‘YYYYMMDDHHMMSS’. so in this example I would like to get all the files except file_4_20200109999999 because the date in it does not exist. expected output: list =[file_1_20200101235900,

Filter pyspark DataFrame by string match

i would like check substring match between comments and keyword column and find if anyone of the keywords present in that particular row. input expected output Answer The most efficient here is to loop, you can use set intersection: Output: Used input: With a minor variation you could check for substring match (“activ” would match “activateds”): Output:

Most efficient way of applying a function based on condition

Suppose we have a master dictionary master_dict = {“a”: df1, “b”: df2, “c”: df3}. Now suppose we have a list called condition_list. Suppose func is a function that returns a new dictionary that has the original keys of master_dict along with potentially new keys. What is the best way to get the below code to work when the length of

Pyspark: regex search with text in a list withColumn

I am new to Spark and I am having a silly “what’s-the-best-approach” issue. Basically, I have a map(dict) that I would like to loop over. During each iteration, I want to search through a column in a spark dataframe using rlike regex and assign the key of the dict to a new column using withColumn The data sample is shown

How to select rows from list in PySpark

Suppose we have two dataframes df1 and df2 where df1 has columns [a, b, c, p, q, r] and df2 has columns [d, e, f, a, b, c]. Suppose the common columns are stored in a list common_cols = [‘a’, ‘b’, ‘c’]. How do you join the two dataframes using the common_cols list within a sql command? The code below

Advertisement