Filter Pyspark dataframe column with None value

Question

I&#8217;m trying to filter a PySpark dataframe that has None as a row value: and I can filter correctly with an string value: but this fails: But there are definitely values on each category. What&#8217;s going on? Answer You can use Column.isNull / Column.isNotNull: If you want to simply drop NULL values you…

Accepted Answer

You can use Column.isNull / Column.isNotNull:df.where(col("dt_mvmt").isNull())df.where(col("dt_mvmt").isNotNull())If you want to simply drop NULL values you can use na.drop with subset argument:df.na.drop(subset=["dt_mvmt"])Equality based comparisons with NULL won&#8217;t work because in SQL NULL is undefined so any attempt to compare it with another value returns NULL:sqlContext.sql("SELECT NULL = NULL").show()## +-------------+## |(NULL = NULL)|## +-------------+## |         null|## +-------------+sqlContext.sql("SELECT NULL != NULL").show()## +-------------------+## |(NOT (NULL = NULL))|## +-------------------+## |               null|## +-------------------+The only valid method to compare value with NULL is IS / IS NOT which are equivalent to the isNull / isNotNull method calls.

Advertisement

Answer