Calculate difference between date column entries and date minimum Pyspark

Question

I feel like this is a stupid question, but I cannot seem to figure it out, so here goes. I have a PySpark data frame and one of the columns consists of dates. I want to compute the difference between each date in this column and the minimum date in the column, for the purpose of filtering to the past

Accepted Answer

I found a solution that I don&#8217;t really care for, but it appears to work.df = df.filter(        F.datediff(            F.col("collection_date"),            F.lit(df.agg(F.min(df["collection_date"])).collect()[0][0])        ) >= numberDays    )I&#8217;m don&#8217;t think it&#8217;s particularly good practice to put a collect() operation in the middle of the code, but this works. If anyone has a more &#8220;Sparky&#8221; solution, please let me know.EDIT 3/21/2022:Here is a more Spark-y way of doing this:df = (        df        .sort(F.col("collection_date").asc())        .filter(            F.datediff(                F.col("collection_date"),                F.lit(df.select(F.min("collection_date")).first()["min(collection_date)"])            ) >= numberDays        )    )

Advertisement

Answer