I want to filter multiple rows based on “value” column. Ex, i want filter velocity
from channel_name
column where value>=1 & value <=5
and i want filter Temp
from channel_name
column where value>=0 & value <=2
. Below id my Pysaprk DF.
start_timestamp | channel_name | value |
---|---|---|
2020-11-02 08:51:50 | velocity | 1 |
2020-11-02 09:14:29 | Temp | 0 |
2020-11-02 09:18:32 | velocity | 0 |
2020-11-02 09:32:42 | velocity | 4 |
2020-11-03 13:06:03 | Temp | 2 |
2020-11-03 13:10:01 | Temp | 1 |
2020-11-03 13:54:38 | Temp | 5 |
2020-11-03 14:46:25 | velocity | 5 |
2020-11-03 14:57:31 | Kilometer | 6 |
2020-11-03 15:07:07 | Kilometer | 7 |
Expected DF:
start_timestamp | channel_name | value |
---|---|---|
2020-11-02 08:51:50 | velocity | 1 |
2020-11-02 09:32:42 | velocity | 4 |
2020-11-03 14:46:25 | velocity | 5 |
2020-11-02 09:14:29 | Temp | 0 |
2020-11-03 13:06:03 | Temp | 2 |
2020-11-03 13:10:01 | Temp | 1 |
I tried for channel_name Velocity
and it is working fine.
df1=df.filter((df.channel_name == "velocity") & (df.interpreted_value >= 1 ) & (df.interpreted_value <= 5))
but i dont know how i can do this for multiple channel_name
like Velocity
and Temp
:below is code, also let me know if this is right way to do or how i can do that.
df1=df.filter(((df.channel_name == "velocity") & (df.interpreted_value >= 1 ) & (df.interpreted_value <= 5))) & ((df.channel_name == "Temp") & (df.interpreted_value >= 0 ) & (df.interpreted_value <= 2))))
Advertisement
Answer
You need to use an or (|
) instead of an and(&
) operator when you combine the clauses:
import pyspark.sql.functions as F import pyspark.sql.types as T df = spark.createDataFrame([ ("2020-11-02 08:51:50", "velocity", 1), ("2020-11-02 09:14:29", "Temp", 0), ("2020-11-02 09:18:32", "velocity", 0), ("2020-11-02 09:32:42", "velocity", 4), ("2020-11-03 13:06:03", "Temp", 2), ("2020-11-03 13:10:01", "Temp", 1), ("2020-11-03 13:54:38", "Temp", 5), ("2020-11-03 14:46:25", "velocity", 5), ("2020-11-03 14:57:31", "Kilometer",6), ("2020-11-03 15:07:07", "Kilometer", 7)], ["start_timestamp", "channel_name", "value"]).withColumn("start_timestamp", F.to_timestamp("start_timestamp")) df_filtered = df.filter((((df.channel_name == "velocity") & (df.value >= 1 ) & (df.value <= 5))) | # or instead of and ((df.channel_name == "Temp") & (df.value >= 0 ) & (df.value <= 2))) df_filtered.show()
Output:
+-------------------+------------+-----+ | start_timestamp|channel_name|value| +-------------------+------------+-----+ |2020-11-02 08:51:50| velocity| 1| |2020-11-02 09:14:29| Temp| 0| |2020-11-02 09:32:42| velocity| 4| |2020-11-03 13:06:03| Temp| 2| |2020-11-03 13:10:01| Temp| 1| |2020-11-03 14:46:25| velocity| 5| +-------------------+------------+-----+
The filter you currently apply will return nothing, since you first check if channel name is equal to one specific string and then afterwards check if it is equal to another specific string. With or, only one of the clauses should be true to include the row in in the result dataframe.