Tag: rdd

Comma separated data in rdd (pyspark) indices out of bound problem

I have a csv file which is comma separated. One of the columns has data which is again comma separated. Each row in that specific column has different no of words , hence different number of commas. When I access or perform any sort of operation like filtering (after splitting the data) it throws errors in pyspark. How shall I

Calculating the averages for each KEY in a Pairwise (K,V) RDD in Spark with Python

aggregate apache-spark average python rdd

I want to share this particular Apache Spark with Python solution because documentation for it is quite poor. I wanted to calculate the average value of K/V pairs (stored in a Pairwise RDD), by KEY. Here is what the sample data looks like: Now the following code sequence is a less than optimal way to do it, but it does