I need to calculate Standard deviation row wise assuming that I already have a column with calculated mean per row.I tried this
JavaScript
x
4
1
SD= (reduce(sqrt((add, (abs(col(x)-col("mean"))**2 for x in df.columns[3:])) / n))).alias("SD")
2
dfS = df.withColumn("SD",SD)
3
dfS.select("stddev").show()
4
but I got the following error
JavaScript
1
2
1
AttributeError: 'builtin_function_or_method' object has no attribute '_get_object_id'
2
Advertisement
Answer
Your code is completely mixed up (at its current state it won’t even cause the exception you described in the question). sqrt
should be placed outside reduce
call:
JavaScript
1
22
22
1
from pyspark.sql.functions import col, sqrt
2
from operator import add
3
from functools import reduce
4
5
df = spark.createDataFrame([("_", "_", 2, 1, 2, 3)], ("_1", "_2", "mean"))
6
cols = df.columns[3:]
7
8
sd = sqrt(
9
reduce(add, ((col(x) - col("mean")) ** 2 for x in cols)) / (len(cols) - 1)
10
)
11
12
sd
13
# Column<b'SQRT((((POWER((_4 - mean), 2) + POWER((_5 - mean), 2)) + POWER((_6 - mean), 2)) / 2))'>
14
15
16
df.withColumn("sd", sd).show()
17
# +---+---+----+---+---+---+---+
18
# | _1| _2|mean| _4| _5| _6| sd|
19
# +---+---+----+---+---+---+---+
20
# | _| _| 2| 1| 2| 3|1.0|
21
# +---+---+----+---+---+---+---+
22