I feel like this is a stupid question, but I cannot seem to figure it out, so here goes. I have a PySpark data frame and one of the columns consists of dates. I want to compute the difference between each date
in this column and the minimum date
in the column, for the purpose of filtering to the past numberDays
. I’ve tried several possibilities but nothing seems to work. Here is my most recent attempt:
df = df.filter( F.datediff( F.col("collection_date"), F.lit(F.min(F.col("collection_date"))) ) >= numberDays )
But I’ve also tried:
df_new = df.withColumn("days", df.select("collection_date") - df.select("collection_date").min())
and
df_new = df.withColumn("days", df.select("collection_date") - df.select(F.min("collection_date")))
There’s probably a few others, but I can’t seem to get this to work, although I’m sure there’s an incredibly simple answer.
Advertisement
Answer
I found a solution that I don’t really care for, but it appears to work.
df = df.filter( F.datediff( F.col("collection_date"), F.lit(df.agg(F.min(df["collection_date"])).collect()[0][0]) ) >= numberDays )
I’m don’t think it’s particularly good practice to put a collect()
operation in the middle of the code, but this works. If anyone has a more “Sparky” solution, please let me know.
EDIT 3/21/2022: Here is a more Spark-y way of doing this:
df = ( df .sort(F.col("collection_date").asc()) .filter( F.datediff( F.col("collection_date"), F.lit(df.select(F.min("collection_date")).first()["min(collection_date)"]) ) >= numberDays ) )