Calculating the averages for each KEY in a Pairwise (K,V) RDD in Spark with Python

Question

I want to share this particular Apache Spark with Python solution because documentation for it is quite poor. I wanted to calculate the average value of K/V pairs (stored in a Pairwise RDD), by KEY. Here is what the sample data looks like: Now the following code sequence is a less than optimal way to do it, but it does

Accepted Answer

Now a much better way to do this is to use the rdd.aggregateByKey() method. Because this method is so poorly documented in the Apache Spark with Python documentation &#8212; and is why I wrote this Q&A &#8212; until recently I had been using the above code sequence. But again, it&#8217;s less efficient, so avoid doing it that way unless necessary.Here&#8217;s how to do the same using the rdd.aggregateByKey() method (recommended):By KEY, simultaneously calculate the SUM (the numerator for the average that we want to compute), and COUNT (the denominator for the average that we want to compute):>>> aTuple = (0,0) # As of Python3, you can't pass a literal sequence to a function.>>> rdd1 = rdd1.aggregateByKey(aTuple, lambda a,b: (a[0] + b,    a[1] + 1),                                       lambda a,b: (a[0] + b[0], a[1] + b[1]))Where the following is true about the meaning of each a and b pair above (so you can visualize what&#8217;s happening):   First lambda expression for Within-Partition Reduction Step::   a: is a TUPLE that holds: (runningSum, runningCount).   b: is a SCALAR that holds the next Value   Second lambda expression for Cross-Partition Reduction Step::   a: is a TUPLE that holds: (runningSum, runningCount).   b: is a TUPLE that holds: (nextPartitionsSum, nextPartitionsCount).Finally, calculate the average for each KEY, and collect results.>>> finalResult = rdd1.mapValues(lambda v: v[0]/v[1]).collect()>>> print(finalResult)      [(u'2013-09-09', 11.235365503035176),       (u'2013-09-01', 23.39500642456595),       (u'2013-09-03', 13.53240060820617),       (u'2013-09-05', 13.141148418977687),   ... snip ...  ]I hope this question and answer with aggregateByKey() will help.

Advertisement

Answer