I’m trying to filter a PySpark dataframe that has None
as a row value:
JavaScript
x
9
1
df.select('dt_mvmt').distinct().collect()
2
3
[Row(dt_mvmt=u'2016-03-27'),
4
Row(dt_mvmt=u'2016-03-28'),
5
Row(dt_mvmt=u'2016-03-29'),
6
Row(dt_mvmt=None),
7
Row(dt_mvmt=u'2016-03-30'),
8
Row(dt_mvmt=u'2016-03-31')]
9
and I can filter correctly with an string value:
JavaScript
1
3
1
df[df.dt_mvmt == '2016-03-31']
2
# some results here
3
but this fails:
JavaScript
1
5
1
df[df.dt_mvmt == None].count()
2
0
3
df[df.dt_mvmt != None].count()
4
0
5
But there are definitely values on each category. What’s going on?
Advertisement
Answer
You can use Column.isNull
/ Column.isNotNull
:
JavaScript
1
4
1
df.where(col("dt_mvmt").isNull())
2
3
df.where(col("dt_mvmt").isNotNull())
4
If you want to simply drop NULL
values you can use na.drop
with subset
argument:
JavaScript
1
2
1
df.na.drop(subset=["dt_mvmt"])
2
Equality based comparisons with NULL
won’t work because in SQL NULL
is undefined so any attempt to compare it with another value returns NULL
:
JavaScript
1
15
15
1
sqlContext.sql("SELECT NULL = NULL").show()
2
## +-------------+
3
## |(NULL = NULL)|
4
## +-------------+
5
## | null|
6
## +-------------+
7
8
9
sqlContext.sql("SELECT NULL != NULL").show()
10
## +-------------------+
11
## |(NOT (NULL = NULL))|
12
## +-------------------+
13
## | null|
14
## +-------------------+
15
The only valid method to compare value with NULL
is IS
/ IS NOT
which are equivalent to the isNull
/ isNotNull
method calls.