Skip to content
Advertisement

Tag: apache-spark

Spark-submit: undefined function parse_url

The function – parse_url always works fine if we working with spark-sql throw sql-client (via thrift server), IPython, pyspark-shell, but it doesn’t work throw spark-submit mode: The error is: So, we are using workaround here: Please, any help with this issue? Answer Spark >= 2.0 Same as below, but use SparkSession with Hive support enabled: Spark < 2.0 parse_url is

How to sort by value efficiently in PySpark?

I want to sort my K,V tuples by V, i.e. by the value. I know that TakeOrdered is good for this if you know how many you need: Using TakeOrdered: Using Lambda I’ve checked out the question here, which suggests the latter. I find it hard to believe that takeOrdered is so succinct and yet it requires the same number

Remove duplicates from a dataframe in PySpark

I’m messing around with dataframes in pyspark 1.4 locally and am having issues getting the dropDuplicates method to work. It keeps returning the error: “AttributeError: ‘list’ object has no attribute ‘dropDuplicates'” Not quite sure why as I seem to be following the syntax in the latest documentation. Answer It is not an import problem. You simply call .dropDuplicates() on a

Advertisement