Skip to content
Advertisement

Calculate difference between date column entries and date minimum Pyspark

I feel like this is a stupid question, but I cannot seem to figure it out, so here goes. I have a PySpark data frame and one of the columns consists of dates. I want to compute the difference between each date in this column and the minimum date in the column, for the purpose of filtering to the past numberDays. I’ve tried several possibilities but nothing seems to work. Here is my most recent attempt:

df = df.filter(
        F.datediff(
            F.col("collection_date"),
            F.lit(F.min(F.col("collection_date")))
        ) >= numberDays
    )

But I’ve also tried:

df_new = df.withColumn("days", df.select("collection_date") - df.select("collection_date").min())

and

df_new = df.withColumn("days", df.select("collection_date") - df.select(F.min("collection_date")))

There’s probably a few others, but I can’t seem to get this to work, although I’m sure there’s an incredibly simple answer.

Advertisement

Answer

I found a solution that I don’t really care for, but it appears to work.

df = df.filter(
        F.datediff(
            F.col("collection_date"),
            F.lit(df.agg(F.min(df["collection_date"])).collect()[0][0])
        ) >= numberDays
    )

I’m don’t think it’s particularly good practice to put a collect() operation in the middle of the code, but this works. If anyone has a more “Sparky” solution, please let me know.

EDIT 3/21/2022: Here is a more Spark-y way of doing this:

df = (
        df
        .sort(F.col("collection_date").asc())
        .filter(
            F.datediff(
                F.col("collection_date"),
                F.lit(df.select(F.min("collection_date")).first()["min(collection_date)"])
            ) >= numberDays
        )
    )
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement