Skip to content

Tag: pyspark

PySpark – Combine a list of filtering conditions

For starters, let me define a sample dataframe and import the sql functions: This returns the following dataframe: Now lets say I have a list of filtering conditions, for example, a list of filtering conditions detailing that columns A and B shall be equal to 1 I can combine these two conditions as follows an…

I want to groupby id and count the unique grade and return max

I have this data and try to solve the following question. DataFrame_from_Scratch = spark.createDataFrame(values, columns) DataFrame_from_Scratch.show() groupby id and count unique grade what is the maximum groupby id and date and how many unique date is there Answer Your implementation for the 1st question is…

using map/reduce on lists of lists

I have a very large list of lists, and I want to use map/reduce techniques (in Python/PySpark), in an efficient way, to calculate the PageRank of the network made of the elements in the list of lists that sharing a list means a link between them. I have no clue how to deal with the elements in the lists becau…

Caching a PySpark Dataframe

Suppose we have a PySpark dataframe df with ~10M rows. Also let the columns be [col_a, col_b]. Which would be faster: or Would caching df_test make sense here? Answer It won’t make much difference. it is just one loop where you can skip cache like below Here spark is loading Data once in memory. If you …