Skip to content
Advertisement

How to filter multiple rows based on rows and columns condition in pyspark

I want to filter multiple rows based on “value” column. Ex, i want filter velocity from channel_name column where value>=1 & value <=5 and i want filter Temp from channel_name column where value>=0 & value <=2. Below id my Pysaprk DF.

start_timestamp channel_name value
2020-11-02 08:51:50 velocity 1
2020-11-02 09:14:29 Temp 0
2020-11-02 09:18:32 velocity 0
2020-11-02 09:32:42 velocity 4
2020-11-03 13:06:03 Temp 2
2020-11-03 13:10:01 Temp 1
2020-11-03 13:54:38 Temp 5
2020-11-03 14:46:25 velocity 5
2020-11-03 14:57:31 Kilometer 6
2020-11-03 15:07:07 Kilometer 7

Expected DF:

start_timestamp channel_name value
2020-11-02 08:51:50 velocity 1
2020-11-02 09:32:42 velocity 4
2020-11-03 14:46:25 velocity 5
2020-11-02 09:14:29 Temp 0
2020-11-03 13:06:03 Temp 2
2020-11-03 13:10:01 Temp 1

I tried for channel_name Velocity and it is working fine.

df1=df.filter((df.channel_name ==  "velocity") & (df.interpreted_value >= 1 )  & (df.interpreted_value <= 5))

but i dont know how i can do this for multiple channel_name like Velocity and Temp :below is code, also let me know if this is right way to do or how i can do that.

df1=df.filter(((df.channel_name ==  "velocity") & (df.interpreted_value >= 1 )  & 
                  (df.interpreted_value <= 5))) &
              ((df.channel_name ==  "Temp") & (df.interpreted_value >= 0 )  & 
                  (df.interpreted_value <= 2)))) 

Advertisement

Answer

You need to use an or (|) instead of an and(&) operator when you combine the clauses:

import pyspark.sql.functions as F
import pyspark.sql.types as T
df = spark.createDataFrame([
  ("2020-11-02 08:51:50", "velocity", 1),
  ("2020-11-02 09:14:29", "Temp", 0),
  ("2020-11-02 09:18:32", "velocity", 0),
  ("2020-11-02 09:32:42", "velocity", 4),
  ("2020-11-03 13:06:03", "Temp", 2),
  ("2020-11-03 13:10:01", "Temp", 1),
  ("2020-11-03 13:54:38", "Temp", 5),
  ("2020-11-03 14:46:25", "velocity", 5),
  ("2020-11-03 14:57:31", "Kilometer",6),
  ("2020-11-03 15:07:07", "Kilometer", 7)], 
  ["start_timestamp", "channel_name", "value"]).withColumn("start_timestamp", F.to_timestamp("start_timestamp"))

df_filtered = df.filter((((df.channel_name ==  "velocity") & (df.value >= 1 )  & 
                  (df.value <= 5))) | # or instead of and
              ((df.channel_name ==  "Temp") & (df.value >= 0 )  & 
                  (df.value <= 2)))
df_filtered.show()

Output:

+-------------------+------------+-----+
|    start_timestamp|channel_name|value|
+-------------------+------------+-----+
|2020-11-02 08:51:50|    velocity|    1|
|2020-11-02 09:14:29|        Temp|    0|
|2020-11-02 09:32:42|    velocity|    4|
|2020-11-03 13:06:03|        Temp|    2|
|2020-11-03 13:10:01|        Temp|    1|
|2020-11-03 14:46:25|    velocity|    5|
+-------------------+------------+-----+

The filter you currently apply will return nothing, since you first check if channel name is equal to one specific string and then afterwards check if it is equal to another specific string. With or, only one of the clauses should be true to include the row in in the result dataframe.

User contributions licensed under: CC BY-SA
2 People found this is helpful
Advertisement