I have a very large list of lists, and I want to use map/reduce techniques (in Python/PySpark), in an efficient way, to calculate the PageRank of the network made of the elements in the list of lists that sharing a list means a link between them. I have no clue how to deal with the elements in the lists because considering all the possible pairs would be an unimaginably complex process.
Suppose this is the data (while it is a very huge list, sometimes with hundreds of elements in lists):
data = [[n1, n2], [n1, n3, n4, n5], [n2, n5, n7]]
For example, having something like the following would be much better than what I have:
n1 n2 n1 n3 n1 n4 n1 n5 n3 n4 n3 n5 n4 n5 n2 n5 n2 n7 n5 n7
Actually, I want to use the MapReduce technique to see how can I handle the situations like this in the future.
Advertisement
Answer
First I thought of using map and then reduce for removing same pairs but below solution using itertools also seemed fine to me
import itertools data = [["n1", "n2"], ["n1", "n3", "n4", "n5"], ["n2", "n5", "n7"]] rd=sc.parallelize(data) rd=rd.flatMap(lambda x:itertools.combinations(x,2)) rd.collect() #output Out[60]: [('n1', 'n2'), ('n1', 'n3'), ('n1', 'n4'), ('n1', 'n5'), ('n3', 'n4'), ('n3', 'n5'), ('n4', 'n5'), ('n2', 'n5'), ('n2', 'n7'), ('n5', 'n7')]