How to filter multiple rows based on rows and columns condition in pyspark

Question

I want to filter multiple rows based on "value" column. Ex, i want filter velocity from channel_name column where value>=1 & value <=5 and i want filter Temp from channel_name column where value>=0 & value <=2. Below id my Pysaprk DF. start_timestamp channel_name value 2020-11-02 08:51:50 velocity 1 2020-11-02 09:14:29 Temp 0 2020-11-02 09:18:32 velocity 0 2020-11-02 09:32:42 velocity 4

Accepted Answer

You need to use an or (|) instead of an and(&) operator when you combine the clauses:import pyspark.sql.functions as Fimport pyspark.sql.types as Tdf = spark.createDataFrame([ ("2020-11-02 08:51:50", "velocity", 1), ("2020-11-02 09:14:29", "Temp", 0), ("2020-11-02 09:18:32", "velocity", 0), ("2020-11-02 09:32:42", "velocity", 4), ("2020-11-03 13:06:03", "Temp", 2), ("2020-11-03 13:10:01", "Temp", 1), ("2020-11-03 13:54:38", "Temp", 5), ("2020-11-03 14:46:25", "velocity", 5), ("2020-11-03 14:57:31", "Kilometer",6), ("2020-11-03 15:07:07", "Kilometer", 7)], ["start_timestamp", "channel_name", "value"]).withColumn("start_timestamp", F.to_timestamp("start_timestamp"))df_filtered = df.filter((((df.channel_name == "velocity") & (df.value >= 1 ) & (df.value <= 5))) | # or instead of and ((df.channel_name == "Temp") & (df.value >= 0 ) & (df.value <= 2)))df_filtered.show()Output:+-------------------+------------+-----+| start_timestamp|channel_name|value|+-------------------+------------+-----+|2020-11-02 08:51:50| velocity| 1||2020-11-02 09:14:29| Temp| 0||2020-11-02 09:32:42| velocity| 4||2020-11-03 13:06:03| Temp| 2||2020-11-03 13:10:01| Temp| 1||2020-11-03 14:46:25| velocity| 5|+-------------------+------------+-----+The filter you currently apply will return nothing, since you first check if channel name is equal to one specific string and then afterwards check if it is equal to another specific string. With or, only one of the clauses should be true to include the row in in the result dataframe.

start_timestamp	channel_name	value
2020-11-02 08:51:50	velocity	1
2020-11-02 09:14:29	Temp	0
2020-11-02 09:18:32	velocity	0
2020-11-02 09:32:42	velocity	4
2020-11-03 13:06:03	Temp	2
2020-11-03 13:10:01	Temp	1
2020-11-03 13:54:38	Temp	5
2020-11-03 14:46:25	velocity	5
2020-11-03 14:57:31	Kilometer	6
2020-11-03 15:07:07	Kilometer	7

Advertisement

Answer