Calculate the minimum distance to destinations for each origin in pyspark

Question

I have a list of origins and destinations along with their geo coordinates. I need to calculate the minimum distance for each origin to the destinations. Below is my code: I got error like below: my question is: it seems that there is something wrong with withColumn('Distance', haversine_vector(F.col('Origin_Geo'), F.col('Destination_Geo'))). I do not know why. (I'm new to pyspark..) I have

Accepted Answer

You are applying the haversine function to a column where it should be applied to a tuple or an array.If you want to use this lib, you need to create an UDF and to install the haversine package on all your spark nodes.from haversine import haversinefrom pyspark.sql import functions as F, types as Thaversine_udf = F.udf(haversine, T.FloatType())df.withColumn(    "Distance", haversine_udf(F.col("Origin_Geo"), F.col("Destination_Geo"))).groupBy("Origin").agg(F.min("Distance").alias("Min_Distance")).show()If you cannot install the package on every node, then you can simply use the built-in version of the function (cf. Haversine Formula in Python (Bearing and Distance between two GPS points)) &#8211; The formula is heavily dependent on the radius of the earth you choosefrom math import radians, cos, sin, asin, sqrtfrom pyspark.sql import functions as F, types as T@F.udf(T.FloatType())def haversine_udf(point1, point2):    """    Calculate the great circle distance between two points     on the earth (specified in decimal degrees)    """    # convert decimal degrees to radians     lon1, lat1 = point1    lon2, lat2 = point2    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])    # haversine formula     dlon = lon2 - lon1     dlat = lat2 - lat1     a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2    c = 2 * asin(sqrt(a))     r = 6372.8  # Radius of earth in kilometers. Use 3956 for miles    return c * rdf.withColumn(    "Distance", haversine_udf(F.col("Origin_Geo"), F.col("Destination_Geo"))).groupBy("Origin").agg(F.min("Distance").alias("Min_Distance")).show()+------+------------+                                                           |Origin|Min_Distance|+------+------------+|     B|   351.08905||     A|   392.32755|+------+------------+

Advertisement

Answer