I’m totally new to Pyspark, as Pyspark doesn’t have loc feature how can we write this logic. I tried by specifying conditions but couldn’t get the desirable result, any help would be greatly appreciated!
JavaScript
x
6
1
df['Total'] = (df['level1']+df['level2']+df['level3']+df['level4'])/df['Number']
2
df.loc[df['level4'] > 0, 'Total'] += 4
3
df.loc[((df['level3'] > 0) & (df['Total'] < 1)), 'Total'] += 3
4
df.loc[((df['level2'] > 0) & (df['Total'] < 1)), 'Total'] += 2
5
df.loc[((df['level1'] > 0) & (df['Total'] < 1)), 'Total'] += 1
6
Advertisement
Answer
For a data like the following
JavaScript
1
15
15
1
data_ls = [
2
(1, 1, 1, 1, 10),
3
(5, 5, 5, 5, 10)
4
]
5
6
data_sdf = spark.sparkContext.parallelize(data_ls).
7
toDF(['level1', 'level2', 'level3', 'level4', 'number'])
8
9
# +------+------+------+------+------+
10
# |level1|level2|level3|level4|number|
11
# +------+------+------+------+------+
12
# | 1| 1| 1| 1| 10|
13
# | 5| 5| 5| 5| 10|
14
# +------+------+------+------+------+
15
You’re actually updating total
column in each statement, not in an if-then-else way. Your code can be replicated (as is) in pyspark using multiple withColumn()
with when()
like the following.
JavaScript
1
15
15
1
data_sdf.
2
withColumn('total', (func.col('level1') + func.col('level2') + func.col('level3') + func.col('level4')) / func.col('number')).
3
withColumn('total', func.when(func.col('level4') > 0, func.col('total') + 4).otherwise(func.col('total'))).
4
withColumn('total', func.when((func.col('level3') > 0) & (func.col('total') < 1), func.col('total') + 3).otherwise(func.col('total'))).
5
withColumn('total', func.when((func.col('level2') > 0) & (func.col('total') < 1), func.col('total') + 2).otherwise(func.col('total'))).
6
withColumn('total', func.when((func.col('level1') > 0) & (func.col('total') < 1), func.col('total') + 1).otherwise(func.col('total'))).
7
show()
8
9
# +------+------+------+------+------+-----+
10
# |level1|level2|level3|level4|number|total|
11
# +------+------+------+------+------+-----+
12
# | 1| 1| 1| 1| 10| 4.4|
13
# | 5| 5| 5| 5| 10| 6.0|
14
# +------+------+------+------+------+-----+
15
We can merge all the withColumn()
with when()
into a single withColumn()
with multiple when()
statements.
JavaScript
1
18
18
1
data_sdf.
2
withColumn('total', (func.col('level1') + func.col('level2') + func.col('level3') + func.col('level4')) / func.col('number')).
3
withColumn('total',
4
func.when(func.col('level4') > 0, func.col('total') + 4).
5
when((func.col('level3') > 0) & (func.col('total') < 1), func.col('total') + 3).
6
when((func.col('level2') > 0) & (func.col('total') < 1), func.col('total') + 2).
7
when((func.col('level1') > 0) & (func.col('total') < 1), func.col('total') + 1).
8
otherwise(func.col('total'))
9
).
10
show()
11
12
# +------+------+------+------+------+-----+
13
# |level1|level2|level3|level4|number|total|
14
# +------+------+------+------+------+-----+
15
# | 1| 1| 1| 1| 10| 4.4|
16
# | 5| 5| 5| 5| 10| 6.0|
17
# +------+------+------+------+------+-----+
18
It’s like numpy.where
and SQL’s case
statements.