Skip to content

Tag: apache-spark

Median and quantile values in Pyspark

In my dataframe I have an age column. The total number of rows are approx 77 billion. I want to calculate the quantile values of that column using PySpark. I have some code but the computation time is huge (maybe my process is very bad). Is there any good way to improve this? Dataframe example: What I have do…

Spark Calculate Standard deviation row wise

I need to calculate Standard deviation row wise assuming that I already have a column with calculated mean per row.I tried this but I got the following error Answer Your code is completely mixed up (at its current state it won’t even cause the exception you described in the question). sqrt should be pla…

No FileSystem for scheme: s3 with pyspark

I’m trying to read a txt file from S3 with Spark, but I’m getting thhis error: This is my code: This is the full traceback: How can I fix this? Answer If you are using a local machine you can use boto3: (do not forget to setup your AWS S3 credentials). Another clean solution if you are using an AW…