Skip to content
Advertisement

Calculate the minimum distance to destinations for each origin in pyspark

I have a list of origins and destinations along with their geo coordinates. I need to calculate the minimum distance for each origin to the destinations.

Below is my code:

JavaScript

I got error like below:

JavaScript

my question is:

  1. it seems that there is something wrong with withColumn('Distance', haversine_vector(F.col('Origin_Geo'), F.col('Destination_Geo'))). I do not know why. (I’m new to pyspark..)

  2. I have a long list of origins and destinations (both over 30K). Cross join generate numerous combinations of origins and destinations. I wonder if there is any more efficient way to get the min distance?

Thanks a lot in advance.

Advertisement

Answer

You are applying the haversine function to a column where it should be applied to a tuple or an array.

If you want to use this lib, you need to create an UDF and to install the haversine package on all your spark nodes.

JavaScript

If you cannot install the package on every node, then you can simply use the built-in version of the function (cf. Haversine Formula in Python (Bearing and Distance between two GPS points)) – The formula is heavily dependent on the radius of the earth you choose

JavaScript
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement