How to write this pandas logic for pyspark.sql.dataframe.DataFrame without using pandas on spark API?

Question

I&#8217;m totally new to Pyspark, as Pyspark doesn&#8217;t have loc feature how can we write this logic. I tried by specifying conditions but couldn&#8217;t get the desirable result, any help would be greatly appreciated! Answer For a data like the following You&#8217;re actually updating total column in each…

Accepted Answer

For a data like the followingdata_ls = [ (1, 1, 1, 1, 10), (5, 5, 5, 5, 10)]data_sdf = spark.sparkContext.parallelize(data_ls). toDF(['level1', 'level2', 'level3', 'level4', 'number'])# +------+------+------+------+------+# |level1|level2|level3|level4|number|# +------+------+------+------+------+# | 1| 1| 1| 1| 10|# | 5| 5| 5| 5| 10|# +------+------+------+------+------+You’re actually updating total column in each statement, not in an if-then-else way. Your code can be replicated (as is) in pyspark using multiple withColumn() with when() like the following.data_sdf. withColumn('total', (func.col('level1') + func.col('level2') + func.col('level3') + func.col('level4')) / func.col('number')). withColumn('total', func.when(func.col('level4') > 0, func.col('total') + 4).otherwise(func.col('total'))). withColumn('total', func.when((func.col('level3') > 0) & (func.col('total') < 1), func.col('total') + 3).otherwise(func.col('total'))). withColumn('total', func.when((func.col('level2') > 0) & (func.col('total') < 1), func.col('total') + 2).otherwise(func.col('total'))). withColumn('total', func.when((func.col('level1') > 0) & (func.col('total') < 1), func.col('total') + 1).otherwise(func.col('total'))). show()# +------+------+------+------+------+-----+# |level1|level2|level3|level4|number|total|# +------+------+------+------+------+-----+# | 1| 1| 1| 1| 10| 4.4|# | 5| 5| 5| 5| 10| 6.0|# +------+------+------+------+------+-----+We can merge all the withColumn() with when() into a single withColumn() with multiple when() statements.data_sdf. withColumn('total', (func.col('level1') + func.col('level2') + func.col('level3') + func.col('level4')) / func.col('number')). withColumn('total', func.when(func.col('level4') > 0, func.col('total') + 4). when((func.col('level3') > 0) & (func.col('total') < 1), func.col('total') + 3). when((func.col('level2') > 0) & (func.col('total') < 1), func.col('total') + 2). when((func.col('level1') > 0) & (func.col('total') < 1), func.col('total') + 1). otherwise(func.col('total')) ). show()# +------+------+------+------+------+-----+# |level1|level2|level3|level4|number|total|# +------+------+------+------+------+-----+# | 1| 1| 1| 1| 10| 4.4|# | 5| 5| 5| 5| 10| 6.0|# +------+------+------+------+------+-----+It’s like numpy.where and SQL’s case statements.

Advertisement

Answer