Skip to content

Tag: apache-spark

PySpark – Selecting all rows within each group

I have a dataframe similar to below. From the above dataframe, I would like to keep all rows upto the most recent sale relative to the date. So essentially, I will only have unique date for each row. In the case of above example, output would look like: Can you please guide on how can I go to this result?

How to select rows from list in PySpark

Suppose we have two dataframes df1 and df2 where df1 has columns [a, b, c, p, q, r] and df2 has columns [d, e, f, a, b, c]. Suppose the common columns are stored in a list common_cols = [‘a’, ‘b’, ‘c’]. How do you join the two dataframes using the common_cols list within a …